[SPARK-50994][SQL] Perform RDD conversion under tracked execution #49678

BOOTMGR · 2025-01-26T09:45:31Z

What changes were proposed in this pull request?

A new lazy variable materializedRdd is introduced which actualyl holds RDD after it is created (by executing plan).
Dataset#rdd is wrapped within withNewRDDExecutionId, which takes care of important setup tasks, like updating Spark properties in SparkContext's thread-locals, before executing the SparkPlan to fetch data
Dataset#rdd acts like any other RDD operations like reduce or foreachPartition and operates on materializedRdd with new execution id (and initialising it if not done yet)

Why are the changes needed?

When Dataset is converted into RDD, It executes SpakPlan without any execution context. This leads to:

No tracking is available on Spark UI for stages which are necessary to build the RDD.
Spark properties which are local to thread may not be set in the RDD execution context. This leads to these properties not being sent with TaskContext but some operations like reading parquet files depend on these properties (eg, case-sesitivity).

Test scenario:

test("SPARK-50994: RDD conversion is performed with execution context") {
    withSQLConf(SQLConf.CASE_SENSITIVE.key -> "true") {
      withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> "false") {
        withTempDir(dir => {
          val dummyDF = Seq((1, 1.0), (2, 2.0), (3, 3.0), (1, 1.0)).toDF("a", "A")
          dummyDF.write.format("parquet").mode("overwrite").save(dir.getCanonicalPath)

          val df = spark.read.parquet(dir.getCanonicalPath)
          val encoder = ExpressionEncoder(df.schema)
          val deduplicated = df.dropDuplicates(Array("a"))
          val df2 = deduplicated.flatMap(row => Seq(row))(encoder).rdd

          val output = spark.createDataFrame(df2, df.schema)
          checkAnswer(output, Seq(Row(1, 1.0), Row(2, 2.0), Row(3, 3.0)))
        })
      }
    }
  }

In the above scenario,

Call to .rdd triggers execution which performs shuffle after reading parquet
However, while reading parquet file spark.sql.caseSensitive is not set (even though it is passed during session creation) which is referred into SQLConf by parquet-mr reader
This leads to unexpected and wrong result of dropDuplicates as it would drop duplicates by either a or 'A'. Expectation is to drop duplicates by column a
This behaviour is not applicable to vectorized parquet reader because it reads case-sensitivity flag from hadoopContext hence is disabled.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing testcases & new test case added for specific scenario

Was this patch authored or co-authored using generative AI tooling?

No

Correct because `checkAnswer` in testcase calls `rdd.count()` which is now a tracked operation and Spark event listener is invoked for the same

BOOTMGR · 2025-01-26T14:45:41Z

Marking WIP, this would require some more work around event listeners and observable due to exposure of RDD stages.

`materializedRdd` is the actual holder which is initialized on-demand by operations like `.rdd`, `foreachPartition` etc.

BOOTMGR · 2025-01-27T09:46:51Z

Ready for view

Perform RDD conversion under tracked execution

88b8b1b

github-actions bot added the SQL label Jan 26, 2025

BOOTMGR changed the title ~~Perform RDD conversion under tracked execution~~ SPARK-50994: Perform RDD conversion under tracked execution Jan 26, 2025

BOOTMGR changed the title ~~SPARK-50994: Perform RDD conversion under tracked execution~~ [SPARK-50994][SQL] Perform RDD conversion under tracked execution Jan 26, 2025

Fix failing test case

0c93694

Correct because `checkAnswer` in testcase calls `rdd.count()` which is now a tracked operation and Spark event listener is invoked for the same

BOOTMGR changed the title ~~[SPARK-50994][SQL] Perform RDD conversion under tracked execution~~ [SPARK-50994][SQL][WIP] Perform RDD conversion under tracked execution Jan 26, 2025

BOOTMGR added 2 commits January 27, 2025 09:17

Behave Dataset#rdd like any other operations on RDD

6388c61

`materializedRdd` is the actual holder which is initialized on-demand by operations like `.rdd`, `foreachPartition` etc.

style fix

97184ff

BOOTMGR changed the title ~~[SPARK-50994][SQL][WIP] Perform RDD conversion under tracked execution~~ [SPARK-50994][SQL] Perform RDD conversion under tracked execution Jan 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-50994][SQL] Perform RDD conversion under tracked execution #49678

[SPARK-50994][SQL] Perform RDD conversion under tracked execution #49678

BOOTMGR commented Jan 26, 2025 •

edited

Loading

BOOTMGR commented Jan 26, 2025 •

edited

Loading

BOOTMGR commented Jan 27, 2025

[SPARK-50994][SQL] Perform RDD conversion under tracked execution #49678

Are you sure you want to change the base?

[SPARK-50994][SQL] Perform RDD conversion under tracked execution #49678

Conversation

BOOTMGR commented Jan 26, 2025 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

BOOTMGR commented Jan 26, 2025 • edited Loading

BOOTMGR commented Jan 27, 2025

BOOTMGR commented Jan 26, 2025 •

edited

Loading

BOOTMGR commented Jan 26, 2025 •

edited

Loading