Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-51185][Core] Revert simplifications to PartitionedFileUtil API to reduce memory requirements #49915

Closed
wants to merge 1 commit into from

Conversation

LukasRupprecht
Copy link
Contributor

What changes were proposed in this pull request?

This PR reverts an earlier change (#41632) that converted FileStatusWithMetadata.getPath from a def to a lazy val in order to simplify the PartitionedFileUtils helpers.

Why are the changes needed?

The conversion of getPath from a def to a lazy val increases the memory requirements because now paths need to be kept in memory as long as the FileStatusWithMetadata exists. As paths are expensive to store, this can lead to higher memory utilization and increase the risk for OOMs.

Does this PR introduce any user-facing change?

No

How was this patch tested?

This is a small revert to code that has already existed before so the existing tests are sufficient.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Feb 13, 2025
@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Feb 13, 2025

cc @gengliangwang from

Also, cc @cloud-fan as a release manager of Apache Spark 4.0.0. (Although this PR aims for all live branches, master/branch-4.0/branch-3.5).

@cloud-fan
Copy link
Contributor

cloud-fan commented Feb 13, 2025

thanks, merging to master/4.0!

@cloud-fan cloud-fan closed this in 74d88b6 Feb 13, 2025
cloud-fan pushed a commit that referenced this pull request Feb 13, 2025
… to reduce memory requirements

### What changes were proposed in this pull request?

This PR reverts an earlier change (#41632) that converted FileStatusWithMetadata.getPath from a def to a lazy val in order to simplify the PartitionedFileUtils helpers.

### Why are the changes needed?

The conversion of getPath from a def to a lazy val increases the memory requirements because now paths need to be kept in memory as long as the FileStatusWithMetadata exists. As paths are expensive to store, this can lead to higher memory utilization and increase the risk for OOMs.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

This is a small revert to code that has already existed before so the existing tests are sufficient.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #49915 from LukasRupprecht/def_get-path.

Authored-by: Lukas Rupprecht <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 74d88b6)
Signed-off-by: Wenchen Fan <[email protected]>
@cloud-fan
Copy link
Contributor

it conflicts with 3.5, @LukasRupprecht can you open a new 3.5 PR? thanks!

@yaooqinn
Copy link
Member

Late +1

@dongjoon-hyun
Copy link
Member

I resolved the issue with the Fix Version, 4.0.0, for now.

Screenshot 2025-02-13 at 12 47 03

@LukasRupprecht
Copy link
Contributor Author

Thanks @cloud-fan for merging this! Will prepare a separate PR for 3.5.

@LukasRupprecht
Copy link
Contributor Author

@cloud-fan @dongjoon-hyun Here is the 3.5 version of this PR: #49995.

cloud-fan pushed a commit that referenced this pull request Feb 21, 2025
…l API to reduce memory requirements

### What changes were proposed in this pull request?

This PR reverts an earlier change (#41632) that converted FileStatusWithMetadata.getPath from a def to a lazy val in order to simplify the PartitionedFileUtils helpers.

This is the 3.5 PR. The main PR for 4.0 is #49915.

### Why are the changes needed?

The conversion of getPath from a def to a lazy val increases the memory requirements because now paths need to be kept in memory as long as the FileStatusWithMetadata exists. As paths are expensive to store, this can lead to higher memory utilization and increase the risk for OOMs.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

This is a small revert to code that has already existed before so the existing tests are sufficient.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #49995 from LukasRupprecht/def_get-path_3.5.

Authored-by: Lukas Rupprecht <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants