[SPARK-53401][SQL] Enable Direct Passthrough Partitioning in the DataFrame API #52153

shujingyang-db · 2025-08-28T00:31:06Z

What changes were proposed in this pull request?

Currently, Spark's DataFrame repartition() API only supports hash-based and range-based partitioning strategies. Users who need precise control over which partition each row goes to (similar to RDD's partitionBy with custom partitioners) have no direct way to achieve this at the DataFrame level.

This PR introduces a new DataFrame API, repartitionById(col, numPartitions), an API that allows users to directly specify target partition IDs in DataFrame repartitioning operations:

// Partition rows based on a computed partition ID
val df = spark.range(100).withColumn("partition_id", col("id") % 10)
val repartitioned = df.repartitionById($"partition_id", 10)

Why are the changes needed?

Better DataFrame API

Does this PR introduce any user-facing change?

Yes.

How was this patch tested?

New Unit Tests in DataFrameSuite

Was this patch authored or co-authored using generative AI tooling?

No

zhengruifeng · 2025-08-28T03:06:13Z

sql/api/src/main/scala/org/apache/spark/sql/functions.scala

@@ -2045,6 +2045,19 @@ object functions {
   */
  def spark_partition_id(): Column = Column.fn("spark_partition_id")

+  /**
+   * Returns the partition ID specified by the given column expression for direct shuffle
+   * partitioning. The input expression must evaluate to an integral type and must not be null.


will this partition id be changed by AQE?

itskals · 2025-08-28T18:24:00Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala

+ *
+ * This partitioning maps directly to the PartitionIdPassthrough RDD partitioner.
+ */
+case class ShufflePartitionIdPassThrough(


Could creating this on a column with high cardinality lead to a sudden increase in partitions? Will subsequent AQE rules try to act and reduce the number of partitions?

Nope, it will not reuse or remove shuffles. This is more to replace RDD's Partitioner API so people can completely migrate to DataFrame API. For the fact of performance and efficiency, it won't be super useful.

…ass-through

cloud-fan · 2025-08-29T09:23:32Z

sql/core/src/main/scala/org/apache/spark/sql/classic/Dataset.scala

+   * @group typedrel
+   * @since 4.1.0
+   */
+  def repartitionById(partitionIdExpr: Column): Dataset[T] = {


I feel it's risky to provide a default numPartitions. Can we always ask users to specify numPartitions?

cloud-fan · 2025-08-29T09:24:36Z

sql/core/src/main/scala/org/apache/spark/sql/classic/Dataset.scala

+   */
+  def repartitionById(numPartitions: Int, partitionIdExpr: Column): Dataset[T] = {
+    val directShufflePartitionIdCol = Column(DirectShufflePartitionID(partitionIdExpr.expr))
+    repartitionByExpression(Some(numPartitions), Seq(directShufflePartitionIdCol))


We can create RepartitionByExpression directly with a special flag to indicate pass through, then we don't need DirectShufflePartitionID.

cloud-fan · 2025-08-29T09:26:54Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

+    val e = intercept[SparkException] {
+      repartitioned.collect()
+    }
+    assert(e.getCause.isInstanceOf[IllegalArgumentException])


what's the actual error? if the error message is not clear we should do explicit null check, or simply treat null as partition id 0.

cloud-fan · 2025-08-29T09:28:09Z

sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala

@@ -1406,6 +1406,87 @@ class PlannerSuite extends SharedSparkSession with AdaptiveSparkPlanHelper {
    assert(planned.exists(_.isInstanceOf[GlobalLimitExec]))
    assert(planned.exists(_.isInstanceOf[LocalLimitExec]))
  }
+
+  test("SPARK-53401: repartitionById should throw an exception for negative partition id") {


hmm, shall we use pmod then? then the partition id is always positive, see https://docs.databricks.com/aws/en/sql/language-manual/functions/pmod

cloud-fan · 2025-08-29T09:28:34Z

sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala

+    assert(e.getMessage.contains("Index -5 out of bounds"))
+  }
+
+  test("SPARK-53401: repartitionById should throw an exception for partition id >= numPartitions") {


wait, how can this happen if we do mod/pmod?

cloud-fan · 2025-08-29T09:29:54Z

sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala

+    val df = spark.range(100).select($"id" % 10 as "key", $"id" as "value")
+    val grouped =
+      df.repartitionById(10, $"key")
+        .filter($"value" > 50).groupBy($"key").count()


so what this test proves is that Filter can propagate child's output partitioning, which is already proven by other tests and we don't need to verify it again here.

cloud-fan · 2025-08-29T09:30:16Z

sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala

+    checkShuffleCount(grouped, 1)
+  }
+
+  test("SPARK-53401: shuffle reuse after a join that preserves partitioning") {


I think a more interesting test is to prove that a join with id pass-through and hash partitioning will still do a shuffle on the id pass-through side.

shujingyang-db added 2 commits August 27, 2025 15:35

init

d1fe3da

ckp

7146dd8

github-actions bot added the SQL label Aug 28, 2025

HyukjinKwon changed the title ~~[DRAFT][ SPARK-53401] Enable Direct Passthrough Partitioning in the DataFrame API~~ [DRAFT][SPARK-53401] Enable Direct Passthrough Partitioning in the DataFrame API Aug 28, 2025

fix

b3f2a94

zhengruifeng reviewed Aug 28, 2025

View reviewed changes

shujingyang-db marked this pull request as ready for review August 28, 2025 07:07

shujingyang-db changed the title ~~[DRAFT][SPARK-53401] Enable Direct Passthrough Partitioning in the DataFrame API~~ [SPARK-53401] Enable Direct Passthrough Partitioning in the DataFrame API Aug 28, 2025

itskals reviewed Aug 28, 2025

View reviewed changes

shujingyang-db and others added 7 commits August 28, 2025 15:09

repartitionById ckp

fad6256

Merge remote-tracking branch 'spark/master' into direct-partitionId-p…

7be523b

…ass-through

add more tests

53ce88a

add todos

84bafd8

Update PlannerSuite.scala

6a13e3b

Update PlannerSuite.scala

228ca21

Update PlannerSuite.scala

599a3d6

HyukjinKwon approved these changes Aug 29, 2025

View reviewed changes

cloud-fan reviewed Aug 29, 2025

View reviewed changes

HyukjinKwon changed the title ~~[SPARK-53401] Enable Direct Passthrough Partitioning in the DataFrame API~~ [SPARK-53401][SQL] Enable Direct Passthrough Partitioning in the DataFrame API Aug 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-53401][SQL] Enable Direct Passthrough Partitioning in the DataFrame API #52153

[SPARK-53401][SQL] Enable Direct Passthrough Partitioning in the DataFrame API #52153

shujingyang-db commented Aug 28, 2025 •

edited

Loading

Uh oh!

zhengruifeng Aug 28, 2025

Uh oh!

itskals Aug 28, 2025

Uh oh!

HyukjinKwon Aug 29, 2025

Uh oh!

cloud-fan Aug 29, 2025

Uh oh!

cloud-fan Aug 29, 2025

Uh oh!

cloud-fan Aug 29, 2025

Uh oh!

cloud-fan Aug 29, 2025

Uh oh!

cloud-fan Aug 29, 2025

Uh oh!

cloud-fan Aug 29, 2025 •

edited

Loading

Uh oh!

cloud-fan Aug 29, 2025

Uh oh!

cloud-fan Aug 29, 2025

Uh oh!

Uh oh!

[SPARK-53401][SQL] Enable Direct Passthrough Partitioning in the DataFrame API #52153

Are you sure you want to change the base?

[SPARK-53401][SQL] Enable Direct Passthrough Partitioning in the DataFrame API #52153

Conversation

shujingyang-db commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shujingyang-db commented Aug 28, 2025 •

edited

Loading

cloud-fan Aug 29, 2025 •

edited

Loading