[SPARK-50992][SQL] OOMs and performance issues with AQE in large plans #49724

SauronShepherd · 2025-01-29T07:08:58Z

What changes were proposed in this pull request?

This PR introduces a new explain mode off to disable the generation of physical plan strings. It also modifies the internal attribute cachedName of CachedRDDBuilder objects.

Why are the changes needed?

Whenever a plan changes (which happens frequently when AQE kicks in), the physical plan's explain is generated as a plain string. This process is highly expensive for large plans. Moreover, these strings are stored in the ListenerBus of SparkContext, consuming heap memory and potentially leading to OutOfMemory errors.

Due to its potential negative impact on Spark applications, this information should be available only on demand for debugging purposes. This PR introduces a new explain mode off, which is set as the default to prevent unnecessary string generation. However, explicit explanations of a DataFrame remain accessible even when this mode is active.

Additionally, when a CachedRDDBuilder object is created without a defined tableName, the full string representation of the plan is also computed, only to later extract the first 1024 characters. This is an expensive operation and has been replaced with a more efficient call to simpleStringWithNodeId to avoid unnecessary computation.

IMPORTANT NOTE: This issue is causing an OutOfMemory (OOM) error in certain unit tests within GraphFrames, as reported in Connected Components gives wrong results. It may also be a contributing factor to the frequent overuse of checkpoints not only in GraphFrames, but also for many Spark users.

Does this PR introduce any user-facing change?

Yes. By default, plan descriptions will no longer be available in the Spark UI. If users require this information, they must explicitly enable it by setting the spark.sql.ui.explainMode Spark configuration.

How was this patch tested?

Unit tests from sql/core and sql/catalyst, along with the test attached to the SPARK-50992 ticket.

Was this patch authored or co-authored using generative AI tooling?

No.

dongjoon-hyun

Thank you for making a PR, @SauronShepherd . I have a few comments.

Please revise the PR title to describe what this PR code provides. The current title is proper for JIRA issue report, but looks improper for PR title.
Although I understand the intention, please don't change the default in this PR because it's a behavior change.
Please minimize the change by adding new value off at the end. For example,

- or 'formatted'.
+ , 'formatted', or 'off'.

dongjoon-hyun · 2025-01-29T23:05:36Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala


  val cachedName = tableName.map(n => s"In-memory table $n")
-    .getOrElse(StringUtils.abbreviate(cachedPlan.toString, 1024))
+    .getOrElse(cachedPlan.simpleStringWithNodeId())


Is this equivalent logically?

It's not but looks fine as a display-only name for the cache.

It might be helpful to provide an option to pass this name directly in the persist and cache methods, allowing users to configure it as needed. What do you think?

SauronShepherd · 2025-01-30T04:07:30Z

I was expecting your first point since it's a behavioral change. However, given that keeping the default value as-is has a major impact on memory I thought it was worth discussing. I still believe that converting plans to strings shouldn't be Spark's default behavior unless strictly necessary. Besides, I'm not sure many Spark users would even notice it -well, actually maybe they would do, because of the performance improvement in their applications-.

I've been working with Spark for over eight years, and whenever I needed to inspect a plan, I explicitly ran the explain method rather than relying on the Spark UI. However, I’ve often seen teams resort to checkpointing their DataFrames simply because Spark took what felt like an eternity to begin execution— and this month I've observed the same looking at GraphFrames code. I still don’t see the need for Spark to be so verbose internally; in my view, performance should take precedence over verbosity in logging.

Regarding the new "off" value being placed at the end, the changes are minimal. However, to me, it makes more sense for "off" to come before the value that produces less content (i.e., "simple"), following a logical progression from lower to higher verbosity.

As for the change in cachedName, it doesn’t yield the exact same result, but since it’s merely a literal identifier -a description- and not a key, this difference shouldn't matter. Like the previous changes, this modification is crucial to avoid traversing the entire tree of the plan each time a CachedRDDBuilder is created. Because even with the new explain mode off, despite avoiding the OOM, generating the string still has a significant impact in performance, even for not too large plans.

That said, I have no problem in making these changes in my PR—I was just hoping to address an internal behavior that I’m not convinced Spark should have by default.

Thanks for your feedback, @dongjoon-hyun

Ángel

cloud-fan · 2025-02-06T07:31:37Z

@SauronShepherd shall we at least report the final AQE plan to the UI? I agree that it's a bad idea to generate the plan string for every AQE plan change as it's expensive. But we should at least do it once so that the UI shows the final plan.

SauronShepherd · 2025-02-12T04:49:35Z

@cloud-fan showing the final plan - if it's not expensive in time and memory consumption - could be a great idea but may require more changes?

JackBuggins · 2025-03-07T14:28:00Z

Strongly agree with @SauronShepherd, many will have workflows where the final plan on the UI is not critical, many opt to debug and understand plans via explain.

Off and Off with final plan on the UI I view as two separate options, the latter is not being addressed and that's fine.

Could this become opt-in so the PR is updated to be a non-default option, and a new task is created to complete the additional changes being requested to show the final plan in the UI before it would be made into the default?

If you are hitting this problem and don't use/need the UI, this is highly valuable. I would be happy to make minor adjustments to my config for the benefits on offer for performance as it does not impact my workflow.

cloud-fan · 2025-03-26T13:59:24Z

I think generating the EXPLAIN string once per query is OK, as that was the case already before AQE. The real issue is the AQE plan change event being too frequent and each event generates the EXPLAIN string once.

My proposal is to only have two AQE plan change events: one for reporting the initial plan and one for reporting the final plan. We can still keep the full plan change history in the debugging logs or have a config to still generate events for them.

Additionally, when a CachedRDDBuilder object is created without a defined tableName, the full string representation of the plan is also computed, only to later extract the first 1024 characters.

This sounds like a separated issue. Can we open a new PR for it?

SauronShepherd · 2025-03-26T15:38:58Z

I think generating the EXPLAIN string once per query is OK

It's OK ... as long as it doesn't affect the performance, but I'm afraid it does. I wrote an article about that.

The real issue is the AQE plan change event being too frequent and each event generates the EXPLAIN string once.

The way I see it, that's not the real issue here. The problem is not that lots of explains are performed. Yes, there are quite a few, but the real problem is that each explain becomes more and more costly as the plan gets bigger and bigger (because all the extra steps that AQE needs and includes).
I think the proposed new "off" explain mode is the right approach to fix this, because in most cases and for most people ... all those explains are never seen/analized.

Do you know any other framework that generates so huge amount of verbose debugging internal messages only in case the developer wants to look at them? I don't. That's not only time, but also money to companies.

This sounds like a separated issue. Can we open a new PR for it?

Totally agree with you. My mistake. For me it was all related, because adding the new explain mode fixed the OOM but didn't fix the performance problem. .But you're right, this other thing can be perfectly in a new PR.
Btw, I was thinking about (optionally) giving the developer the chance to set the tableName in the cache/persist methods. What do you think? Should I include that in this new PR or split it into two different PRs?

Thanks for your thoughts on this issue.

SauronShepherd · 2025-04-14T07:16:24Z

I forgot to update Spark Connect and tests with the changes proposed here. Besides, I'm going to keep the changes simpler and without user-facing changes. I'll open a new PR in short.

github-actions bot added the SQL label Jan 29, 2025

SauronShepherd closed this Jan 29, 2025

SauronShepherd reopened this Jan 29, 2025

dongjoon-hyun requested changes Jan 29, 2025

View reviewed changes

dongjoon-hyun reviewed Jan 29, 2025

View reviewed changes

SauronShepherd mentioned this pull request Feb 25, 2025

Do not orphan out of scope persisted dataframes in ConnectedComponent… graphframes/graphframes#459

Closed

github-actions bot added PYTHON CONNECT labels Mar 21, 2025

SauronShepherd closed this Apr 8, 2025

SauronShepherd force-pushed the SPARK-50992 branch from e241748 to edebb11 Compare April 8, 2025 07:05

SauronShepherd deleted the SPARK-50992 branch April 14, 2025 07:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-50992][SQL] OOMs and performance issues with AQE in large plans #49724

[SPARK-50992][SQL] OOMs and performance issues with AQE in large plans #49724

Uh oh!

SauronShepherd commented Jan 29, 2025

Uh oh!

dongjoon-hyun left a comment •

edited

Loading

Uh oh!

dongjoon-hyun Jan 29, 2025

Uh oh!

cloud-fan Feb 6, 2025

Uh oh!

SauronShepherd Mar 17, 2025

Uh oh!

SauronShepherd commented Jan 30, 2025 •

edited

Loading

Uh oh!

cloud-fan commented Feb 6, 2025 •

edited

Loading

Uh oh!

SauronShepherd commented Feb 12, 2025 •

edited

Loading

Uh oh!

JackBuggins commented Mar 7, 2025

Uh oh!

cloud-fan commented Mar 26, 2025

Uh oh!

SauronShepherd commented Mar 26, 2025 •

edited

Loading

Uh oh!

SauronShepherd commented Apr 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-50992][SQL] OOMs and performance issues with AQE in large plans #49724

[SPARK-50992][SQL] OOMs and performance issues with AQE in large plans #49724

Uh oh!

Conversation

SauronShepherd commented Jan 29, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dongjoon-hyun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Jan 29, 2025

Choose a reason for hiding this comment

Uh oh!

cloud-fan Feb 6, 2025

Choose a reason for hiding this comment

Uh oh!

SauronShepherd Mar 17, 2025

Choose a reason for hiding this comment

Uh oh!

SauronShepherd commented Jan 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Feb 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SauronShepherd commented Feb 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JackBuggins commented Mar 7, 2025

Uh oh!

cloud-fan commented Mar 26, 2025

Uh oh!

SauronShepherd commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SauronShepherd commented Apr 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dongjoon-hyun left a comment •

edited

Loading

SauronShepherd commented Jan 30, 2025 •

edited

Loading

cloud-fan commented Feb 6, 2025 •

edited

Loading

SauronShepherd commented Feb 12, 2025 •

edited

Loading

SauronShepherd commented Mar 26, 2025 •

edited

Loading