forked from apache/spark
-
Notifications
You must be signed in to change notification settings - Fork 1
[pull] master from apache:master #1235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
### What changes were proposed in this pull request? Introduce the offline state repartition API and checkpoint manager in Pyspark. ### Why are the changes needed? For offline state repartitioning ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New test suite added ### Was this patch authored or co-authored using generative AI tooling? Claude-4.5-opus Closes #53931 from micheal-o/micheal-okutubo_data/repart_py_api. Authored-by: micheal-o <[email protected]> Signed-off-by: Anish Shrigondekar <[email protected]>
### What changes were proposed in this pull request? This PR will add validation when accessing the checkpoint to detect this inconsistent state and throw an error before the query can start with a new query ID. ### Why are the changes needed? When a streaming checkpoint directory has non-empty offset and commit logs but is missing the metadata file (containing the streaming query ID), the query will generate a new UUID on restart. This breaks the deduplication mechanism of exactly-once sinks like which relies on the streaming query ID to skip already-processed batches, leading to data duplication. ### Does this PR introduce _any_ user-facing change? Yes. There is a new error condition MISSING_METADATA_FILE that occurs when a streaming checkpoint directory has non-empty offset and commit logs but is missing the metadata file. ### How was this patch tested? Unit tests and integration tests are added. ### Was this patch authored or co-authored using generative AI tooling? Used to assist in writing test suite. Generated-by: Claude Sonnet 4.5 Closes #53844 from jerrytq/jerry-zheng_data/SPARK-55058. Authored-by: Jerry Zheng <[email protected]> Signed-off-by: Anish Shrigondekar <[email protected]>
…up, FMGWS and SessionWindow Operators ### What changes were proposed in this pull request? Add Repartition Integration Test for Aggregate, Dedup, FMGWS and SessionWindow Operators, as well as a query that contains multiple operators The tests verifies that - state data is correct after rewrite - resumed query after repartition loads the correct state data - verify repartition batch has the correct metadata The dimensions that this test covers - increase/decrease partition - enable/disable changelog checkpoint - enable/disable checkpoint id - state version for both Aggregate and FMGWS operators ### Why are the changes needed? We need to create integration test to ensure data correctness for repartitioning on a complete set of operators ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? See added tests ### Was this patch authored or co-authored using generative AI tooling? Yes Closes #53827 from zifeif2/repartition-test. Authored-by: zifeif2 <[email protected]> Signed-off-by: Anish Shrigondekar <[email protected]>
### What changes were proposed in this pull request? This PR adds support for Tuple sketches to Spark SQL, based on the Apache DataSketches Tuple library. Tuple sketches extend the functionality of Theta sketches by associating summary values with each unique key, enabling efficient approximate computations for cardinality estimation with aggregated metadata across multiple dimensions. They provide a compact, probabilistic data structure that maintains both distinct key counts and associated summary values with bounded memory usage and strong accuracy guarantees. It introduces 18 new SQL functions, 6 of which are aggregates, and the rest being scalar. Jira: https://issues.apache.org/jira/browse/SPARK-54179 Note: The * could be either of the variants: 'double' or 'integer' `tuple_sketch_agg_*(key, summary, lgNomEntries, mode)` Creates a tuple sketch from key-summary pairs. Parameters: - key: The key field for distinct counting - Supports: `INT`, `LONG`, `FLOAT`, `DOUBLE`, `STRING`, `BINARY`, `ARRAY[INT]`, `ARRAY[LONG]` - summary: The associated value - Types: `DOUBLE` or `INT` - lgNomEntries *(optional, default = 12)*: - Log-base-2 of nominal entries (`4–26`) - mode *(optional, default = 'sum')*: - Aggregation mode (`'sum'`, `'min'`, `'max'`, `'alwaysone'`) `tuple_union_agg_*(sketch, lgNomEntries, mode)` - Unions multiple tuple sketch binary representations `tuple_intersection_agg_*(sketch, mode)` - Intersects multiple tuple sketch binary representations #### Sketch Inspection Functions `tuple_sketch_estimate_*(sketch)` - Returns the estimated number of unique keys in the sketch `tuple_sketch_theta_*(sketch)` - Returns theta value of the sketch `tuple_sketch_summary_*(sketch, mode)` - Aggregates all summary values from the sketch according to the specified mode #### Set Operation Functions `tuple_union_*(sketch1, sketch2, lgNomEntries, mode)` - Unions two tuple sketches `tuple_intersection_*(sketch1, sketch2, mode)` - Intersects two tuple sketches `tuple_difference_*(sketch1, sketch2)` - Computes A-NOT-B (elements in sketch1 but not in sketch2) ### Why are the changes needed? Spark currently lacks support for tuple sketches, which enable approximate computations on key-value data. Tuple sketches provide: - O(k) space complexity - Bounded memory usage based on sketch size parameter, not data size - High accuracy - Configurable error bounds with proven theoretical guarantees - Fast queries - Efficient cardinality and summary estimation - Mergeable - Sketches can be combined for distributed aggregation across partitions - Multi-dimensional analysis - Track both distinct counts and associated metadata in one structure ### Does this PR introduce _any_ user-facing change? Yes, It introduces 18 new SQL functions. ### How was this patch tested? SQL Golden File Tests: Added tuplesketches.sql with test queries covering: - All summary types (double, integer) - Multiple key types (INT, LONG, FLOAT, DOUBLE, STRING, BINARY, arrays) - All aggregation modes (sum, min, max, alwaysone) - NULL value handling (verified NULLs are ignored) - Sketch aggregation, union, and intersection operations - Set operations (union, intersection, difference) including tuple-theta interoperability - Sketch size configuration (lgNomEntries parameter) - Approximate result validation using tolerance-based comparisons - Negative tests for error conditions (invalid parameters, type mismatches, invalid binary data, incompatible summary types) ### Was this patch authored or co-authored using generative AI tooling? Yes, used claude-sonnet-4.5 for testing and refactoring Closes #52883 from cboumalh/cboumalh-tuple-sketches. Lead-authored-by: Chris Boumalhab <[email protected]> Co-authored-by: Chris Boumalhab <[email protected]> Signed-off-by: Daniel Tenedorio <[email protected]>
### What changes were proposed in this pull request? Fix golden file prefix ### Why are the changes needed? the file names are incorrect ### Does this PR introduce _any_ user-facing change? no, test-only ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #53972 from zhengruifeng/py_input_follow_up. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
…test images
### What changes were proposed in this pull request?
Remove unused packages from python 3.11 test images
### Why are the changes needed?
for more testing disk space
### Does this PR introduce _any_ user-facing change?
no, infra-only
### How was this patch tested?
PR builder with
```
default: '{"PYSPARK_IMAGE_TO_TEST": "python-311", "PYTHON_TO_TEST": "python3.11"}'
```
https://github.com/zhengruifeng/spark/runs/61429429636
### Was this patch authored or co-authored using generative AI tooling?
no
Closes #53962 from zhengruifeng/image_clean_up_311.
Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
…test images
### What changes were proposed in this pull request?
Remove unused packages from python 3.10 test images
### Why are the changes needed?
for more disk
### Does this PR introduce _any_ user-facing change?
NO, infra-only
### How was this patch tested?
PR builder with
```
default: '{"PYSPARK_IMAGE_TO_TEST": "python-310", "PYTHON_TO_TEST": "python3.10"}'
```
https://github.com/zhengruifeng/spark/runs/61429591520
### Was this patch authored or co-authored using generative AI tooling?
no
Closes #53961 from zhengruifeng/image_clean_up_310.
Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )