Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-51185][Core][3.5] Revert simplifications to PartitionedFileUtil API to reduce memory requirements #49995

Closed

Conversation

LukasRupprecht
Copy link
Contributor

What changes were proposed in this pull request?

This PR reverts an earlier change (#41632) that converted FileStatusWithMetadata.getPath from a def to a lazy val in order to simplify the PartitionedFileUtils helpers.

This is the 3.5 PR. The main PR for 4.0 is #49915.

Why are the changes needed?

The conversion of getPath from a def to a lazy val increases the memory requirements because now paths need to be kept in memory as long as the FileStatusWithMetadata exists. As paths are expensive to store, this can lead to higher memory utilization and increase the risk for OOMs.

Does this PR introduce any user-facing change?

No

How was this patch tested?

This is a small revert to code that has already existed before so the existing tests are sufficient.

Was this patch authored or co-authored using generative AI tooling?

No

@LukasRupprecht
Copy link
Contributor Author

cc @cloud-fan @dongjoon-hyun

@cloud-fan
Copy link
Contributor

Compilation failed, probably due to 3.5 using scala 2.12

@LukasRupprecht LukasRupprecht changed the title [SPARK-51185][Core] Revert simplifications to PartitionedFileUtil API to reduce memory requirements [SPARK-51185][Core][3.5] Revert simplifications to PartitionedFileUtil API to reduce memory requirements Feb 20, 2025
@LukasRupprecht
Copy link
Contributor Author

Hmm, looks like the build failures are in ShowTablesExec:

[error] /home/runner/work/spark/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowTablesExec.scala:28:30: object ArrayImplicits is not a member of package org.apache.spark.util
[error] import org.apache.spark.util.ArrayImplicits._
[error]                              ^
[info] done compiling
[info] compiling 11 Scala sources to /home/runner/work/spark/spark/mllib-local/target/scala-2.12/test-classes ...
[error] /home/runner/work/spark/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/ShowTablesExec.scala:54:57: value toImmutableArraySeq is not a member of Array[String]
[error]         .isTempView((ident.namespace() :+ ident.name()).toImmutableArraySeq)
[error]

which this PR is not touching. I just rebased and checks are running again.

@LukasRupprecht
Copy link
Contributor Author

Hmm, looks like it's still failing compilation with the same error. @cloud-fan Do you have an idea why the build would fail with a supposedly unrelated error?

@cloud-fan
Copy link
Contributor

oh it's broken by other commits and #50008 is fixing it

@cloud-fan
Copy link
Contributor

@LukasRupprecht can you rebase your PR and try again? The issue should have been resolved.

@dongjoon-hyun
Copy link
Member

Thank you for making a PR, @LukasRupprecht .

Is this target for Apache Spark 3.5.5, @HyukjinKwon and @cloud-fan ?

@cloud-fan
Copy link
Contributor

@dongjoon-hyun yes it is, let me merge it now, thanks all!

cloud-fan pushed a commit that referenced this pull request Feb 21, 2025
…l API to reduce memory requirements

### What changes were proposed in this pull request?

This PR reverts an earlier change (#41632) that converted FileStatusWithMetadata.getPath from a def to a lazy val in order to simplify the PartitionedFileUtils helpers.

This is the 3.5 PR. The main PR for 4.0 is #49915.

### Why are the changes needed?

The conversion of getPath from a def to a lazy val increases the memory requirements because now paths need to be kept in memory as long as the FileStatusWithMetadata exists. As paths are expensive to store, this can lead to higher memory utilization and increase the risk for OOMs.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

This is a small revert to code that has already existed before so the existing tests are sufficient.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #49995 from LukasRupprecht/def_get-path_3.5.

Authored-by: Lukas Rupprecht <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
@cloud-fan cloud-fan closed this Feb 21, 2025
@dongjoon-hyun
Copy link
Member

Got it. Thank you, @cloud-fan .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants