[pull] master from apache:master #1235

pull · 2026-01-26T23:41:10Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

### What changes were proposed in this pull request? Introduce the offline state repartition API and checkpoint manager in Pyspark. ### Why are the changes needed? For offline state repartitioning ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New test suite added ### Was this patch authored or co-authored using generative AI tooling? Claude-4.5-opus Closes #53931 from micheal-o/micheal-okutubo_data/repart_py_api. Authored-by: micheal-o <[email protected]> Signed-off-by: Anish Shrigondekar <[email protected]>

### What changes were proposed in this pull request? This PR will add validation when accessing the checkpoint to detect this inconsistent state and throw an error before the query can start with a new query ID. ### Why are the changes needed? When a streaming checkpoint directory has non-empty offset and commit logs but is missing the metadata file (containing the streaming query ID), the query will generate a new UUID on restart. This breaks the deduplication mechanism of exactly-once sinks like which relies on the streaming query ID to skip already-processed batches, leading to data duplication. ### Does this PR introduce _any_ user-facing change? Yes. There is a new error condition MISSING_METADATA_FILE that occurs when a streaming checkpoint directory has non-empty offset and commit logs but is missing the metadata file. ### How was this patch tested? Unit tests and integration tests are added. ### Was this patch authored or co-authored using generative AI tooling? Used to assist in writing test suite. Generated-by: Claude Sonnet 4.5 Closes #53844 from jerrytq/jerry-zheng_data/SPARK-55058. Authored-by: Jerry Zheng <[email protected]> Signed-off-by: Anish Shrigondekar <[email protected]>

…up, FMGWS and SessionWindow Operators ### What changes were proposed in this pull request? Add Repartition Integration Test for Aggregate, Dedup, FMGWS and SessionWindow Operators, as well as a query that contains multiple operators The tests verifies that - state data is correct after rewrite - resumed query after repartition loads the correct state data - verify repartition batch has the correct metadata The dimensions that this test covers - increase/decrease partition - enable/disable changelog checkpoint - enable/disable checkpoint id - state version for both Aggregate and FMGWS operators ### Why are the changes needed? We need to create integration test to ensure data correctness for repartitioning on a complete set of operators ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? See added tests ### Was this patch authored or co-authored using generative AI tooling? Yes Closes #53827 from zifeif2/repartition-test. Authored-by: zifeif2 <[email protected]> Signed-off-by: Anish Shrigondekar <[email protected]>

### What changes were proposed in this pull request? This PR adds support for Tuple sketches to Spark SQL, based on the Apache DataSketches Tuple library. Tuple sketches extend the functionality of Theta sketches by associating summary values with each unique key, enabling efficient approximate computations for cardinality estimation with aggregated metadata across multiple dimensions. They provide a compact, probabilistic data structure that maintains both distinct key counts and associated summary values with bounded memory usage and strong accuracy guarantees. It introduces 18 new SQL functions, 6 of which are aggregates, and the rest being scalar. Jira: https://issues.apache.org/jira/browse/SPARK-54179 Note: The * could be either of the variants: 'double' or 'integer' `tuple_sketch_agg_*(key, summary, lgNomEntries, mode)` Creates a tuple sketch from key-summary pairs. Parameters: - key: The key field for distinct counting - Supports: `INT`, `LONG`, `FLOAT`, `DOUBLE`, `STRING`, `BINARY`, `ARRAY[INT]`, `ARRAY[LONG]` - summary: The associated value - Types: `DOUBLE` or `INT` - lgNomEntries *(optional, default = 12)*: - Log-base-2 of nominal entries (`4–26`) - mode *(optional, default = 'sum')*: - Aggregation mode (`'sum'`, `'min'`, `'max'`, `'alwaysone'`) `tuple_union_agg_*(sketch, lgNomEntries, mode)` - Unions multiple tuple sketch binary representations `tuple_intersection_agg_*(sketch, mode)` - Intersects multiple tuple sketch binary representations #### Sketch Inspection Functions `tuple_sketch_estimate_*(sketch)` - Returns the estimated number of unique keys in the sketch `tuple_sketch_theta_*(sketch)` - Returns theta value of the sketch `tuple_sketch_summary_*(sketch, mode)` - Aggregates all summary values from the sketch according to the specified mode #### Set Operation Functions `tuple_union_*(sketch1, sketch2, lgNomEntries, mode)` - Unions two tuple sketches `tuple_intersection_*(sketch1, sketch2, mode)` - Intersects two tuple sketches `tuple_difference_*(sketch1, sketch2)` - Computes A-NOT-B (elements in sketch1 but not in sketch2) ### Why are the changes needed? Spark currently lacks support for tuple sketches, which enable approximate computations on key-value data. Tuple sketches provide: - O(k) space complexity - Bounded memory usage based on sketch size parameter, not data size - High accuracy - Configurable error bounds with proven theoretical guarantees - Fast queries - Efficient cardinality and summary estimation - Mergeable - Sketches can be combined for distributed aggregation across partitions - Multi-dimensional analysis - Track both distinct counts and associated metadata in one structure ### Does this PR introduce _any_ user-facing change? Yes, It introduces 18 new SQL functions. ### How was this patch tested? SQL Golden File Tests: Added tuplesketches.sql with test queries covering: - All summary types (double, integer) - Multiple key types (INT, LONG, FLOAT, DOUBLE, STRING, BINARY, arrays) - All aggregation modes (sum, min, max, alwaysone) - NULL value handling (verified NULLs are ignored) - Sketch aggregation, union, and intersection operations - Set operations (union, intersection, difference) including tuple-theta interoperability - Sketch size configuration (lgNomEntries parameter) - Approximate result validation using tolerance-based comparisons - Negative tests for error conditions (invalid parameters, type mismatches, invalid binary data, incompatible summary types) ### Was this patch authored or co-authored using generative AI tooling? Yes, used claude-sonnet-4.5 for testing and refactoring Closes #52883 from cboumalh/cboumalh-tuple-sketches. Lead-authored-by: Chris Boumalhab <[email protected]> Co-authored-by: Chris Boumalhab <[email protected]> Signed-off-by: Daniel Tenedorio <[email protected]>

### What changes were proposed in this pull request? Fix golden file prefix ### Why are the changes needed? the file names are incorrect ### Does this PR introduce _any_ user-facing change? no, test-only ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #53972 from zhengruifeng/py_input_follow_up. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…test images ### What changes were proposed in this pull request? Remove unused packages from python 3.11 test images ### Why are the changes needed? for more testing disk space ### Does this PR introduce _any_ user-facing change? no, infra-only ### How was this patch tested? PR builder with ``` default: '{"PYSPARK_IMAGE_TO_TEST": "python-311", "PYTHON_TO_TEST": "python3.11"}' ``` https://github.com/zhengruifeng/spark/runs/61429429636 ### Was this patch authored or co-authored using generative AI tooling? no Closes #53962 from zhengruifeng/image_clean_up_311. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…test images ### What changes were proposed in this pull request? Remove unused packages from python 3.10 test images ### Why are the changes needed? for more disk ### Does this PR introduce _any_ user-facing change? NO, infra-only ### How was this patch tested? PR builder with ``` default: '{"PYSPARK_IMAGE_TO_TEST": "python-310", "PYTHON_TO_TEST": "python3.10"}' ``` https://github.com/zhengruifeng/spark/runs/61429591520 ### Was this patch authored or co-authored using generative AI tooling? no Closes #53961 from zhengruifeng/image_clean_up_310. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

micheal-o and others added 7 commits January 26, 2026 10:28

pull bot locked and limited conversation to collaborators Jan 26, 2026

pull bot added the ⤵️ pull label Jan 26, 2026

pull bot merged commit dec114a into huangxiaopingRD:master Jan 26, 2026

github-actions bot added BUILD SQL PYTHON STRUCTURED STREAMING labels Jan 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] master from apache:master #1235

[pull] master from apache:master #1235

Uh oh!

pull bot commented Jan 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[pull] master from apache:master #1235

[pull] master from apache:master #1235

Uh oh!

Conversation

pull bot commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pull bot commented Jan 26, 2026 •

edited

Loading