feat: Add experimental support for native Parquet writes #2812

andygrove · 2025-11-21T18:27:30Z

Which issue does this PR close?

Part of #1625

Rationale for this change

We would eventually like to support native writes to Parquet. This PR adds a starting point for further development.

This is the result of vibe coding with Claude.

The goal is to add the minium possible implementation and test. There are plenty of things that are not implemented or tested yet.

Example of new native plan:

ParquetWriterExec: path=file:/private/var/folders/vv/fmb1n2hx3yqdmxbrv7shzyvr0000gn/T/spark-79afc322-5315-4f47-85dc-c974eeb44d2c/output.parquet, compression=Snappy
  ScanExec: source=[write_source], schema=[col_0: Int32, col_1: Utf8]

What changes are included in this PR?

New native ParquetWriterExec
New scala CometNativeWriteExec
Updates to CometExecRule
One working test

How are these changes tested?

New suite added.

parthchandra · 2025-11-21T19:38:54Z

A good test would be to write with this feature enabled and then read it with and without Comet enabled.

codecov-commenter · 2025-11-21T19:49:32Z

Codecov Report

❌ Patch coverage is 67.47967% with 40 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.16%. Comparing base (f09f8af) to head (8d9b41a).
⚠️ Report is 727 commits behind head on main.

Files with missing lines	Patch %	Lines
...comet/serde/operator/CometDataWritingCommand.scala	60.86%	17 Missing and 10 partials ⚠️
.../apache/spark/sql/comet/CometNativeWriteExec.scala	66.66%	10 Missing and 2 partials ⚠️
...n/scala/org/apache/comet/rules/CometExecRule.scala	92.30%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #2812      +/-   ##
============================================
+ Coverage     56.12%   59.16%   +3.03%     
- Complexity      976     1477     +501     
============================================
  Files           119      167      +48     
  Lines         11743    15188    +3445     
  Branches       2251     2523     +272     
============================================
+ Hits           6591     8986    +2395     
- Misses         4012     4917     +905     
- Partials       1140     1285     +145

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

This reverts commit 55d5d48.

spark/src/main/scala/org/apache/comet/serde/operator/CometDataWritingCommand.scala

comphead · 2025-11-24T05:43:26Z

docs/source/user-guide/latest/operators.md

 | BatchScanExec           | Yes               | Supports Parquet files and Apache Iceberg Parquet scans. See the [Comet Compatibility Guide] for more information. |
 | BroadcastExchangeExec   | Yes               |                                                                                                                    |
 | BroadcastHashJoinExec   | Yes               |                                                                                                                    |
+| DataWritingCommandExec  | No                | Experimental support for native Parquet writes. Disabled by default.                                               |


does it mean also Iceberg writes?

Should we change to InsertIntoHadoopFsRelationCommand here?

Makes sense. I updated this.

comphead · 2025-11-24T05:44:05Z

native/core/src/execution/operators/parquet_writer.rs

+pub struct ParquetWriterExec {
+    /// Input execution plan
+    input: Arc<dyn ExecutionPlan>,
+    /// Output file path


is it file or folder?

It is folder. Files named part-*-.parquet will be created within the folder

comphead · 2025-11-24T05:46:44Z

native/core/src/execution/operators/parquet_writer.rs

+
+        // Strip file:// or file: prefix if present
+        let local_path = output_path
+            .strip_prefix("file://")


what if hdfs:// ?

I added a fallback for now so that it falls back to Spark if the path does not start with file:

comphead · 2025-11-24T05:49:50Z

native/core/src/execution/operators/parquet_writer.rs

+        })?;
+
+        // Generate part file name for this partition
+        let part_file = format!("{}/part-{:05}.parquet", local_path, self.partition_id);


this doesn't seem right, the extension will be different depending on the codec

.snappy.parquet .gz.parquet etc

This file name is best generated by FileCommitProtocol later, so hardcoding it on the native side for now makes sense to me.
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L156-L163

Thanks @wForget, are you proposing to keep hardcoded names for the PR and replicate Spark getFilename later?

Yes, this pr seems to be missing some work related to file commit. My proposed write process might look like this: create a staging dir -> native write files to staging dir -> file commit (move and merge staging files) -> add or update partitions

I filed #2827 for implementing the file commit protocol.

This PR adds a starting point for development. Once it is merged then other contributors can help add the missing features.

comphead · 2025-11-24T05:51:00Z

native/core/src/execution/operators/parquet_writer.rs

+
+        // Execute the write task and convert to a stream
+        use datafusion::physical_plan::stream::RecordBatchStreamAdapter;
+        Ok(Box::pin(RecordBatchStreamAdapter::new(


what if the partition failed? what would happen with the folder?

🤷 this is all highly experimental so far

comphead · 2025-11-24T05:55:23Z

native/proto/src/proto/operator.proto

 }

+message ParquetWriter {
+  string output_path = 1;


what if it is a partiioned writer? df.write.partitionBy()

With this PR, we fall back to Spark for now for partitioned writes. There are checks in getSupportLevel.

comphead

Thanks @andygrove it is a real good start

native/core/src/execution/operators/parquet_writer.rs

spark/src/main/scala/org/apache/comet/serde/operator/CometDataWritingCommand.scala

…WritingCommand.scala Co-authored-by: Zhen Wang <[email protected]>

wForget

Thanks @andygrove , lgtm

andygrove · 2025-11-26T13:46:58Z

Thanks for the reviews @comphead and @wForget. I'm going to go ahead and merge this and will have a draft PR up today for file commit protocol.

andygrove added 16 commits November 21, 2025 09:50

save

64c3dd7

save

c6fe639

save

2a8dc52

save

e4acf93

save

82a552d

save

ba93de7

save [skip ci]

96087ef

delete some tests

e6bb5cf

test passes

8c5a6be

prep for review

42e0c79

improve test

fac5c56

prep for review

30a503f

remove partition id from proto

5174ba3

prep for review

8748d85

move test

f1b7ba8

clippy

3ad4d6e

andygrove added 7 commits November 21, 2025 13:01

code cleanup

f35196b

preserve column names

1ea7f31

fix assertion

61b6690

test

1c00aae

improve test

e1e357e

refactor to use operator serde framework

9fb2b53

skip testing for native_datafusion for now

457a9b9

andygrove mentioned this pull request Nov 21, 2025

Respect Spark compression settings in native Parquet writer #2814

Open

andygrove changed the title ~~feat: Add experimental support for native Parquet writes [WIP]~~ feat: Add experimental support for native Parquet writes Nov 21, 2025

andygrove added 2 commits November 21, 2025 15:30

remove hard-coded compression codec

757ba81

lz4 level

7b9e925

andygrove marked this pull request as ready for review November 21, 2025 22:35

andygrove added 9 commits November 21, 2025 15:47

remove partition id from proto

913f2ba

implement getSupportLevel

de07d45

move config to testing category

eb4fee4

fuzz test

9a6e48c

remove snappy from filename

8b1cb0b

remove snappy from filename

58c056d

docs

bd04274

fix ci

55d5d48

Revert "fix ci"

cda8322

This reverts commit 55d5d48.

wForget reviewed Nov 24, 2025

View reviewed changes

spark/src/main/scala/org/apache/comet/serde/operator/CometDataWritingCommand.scala Outdated Show resolved Hide resolved

comphead reviewed Nov 24, 2025

View reviewed changes

andygrove added 2 commits November 24, 2025 13:22

Merge remote-tracking branch 'apache/main' into parquet-write-poc

6a4726b

partially address feedback

1d1d430

wForget reviewed Nov 25, 2025

View reviewed changes

native/core/src/execution/operators/parquet_writer.rs Outdated Show resolved Hide resolved

wForget reviewed Nov 25, 2025

View reviewed changes

spark/src/main/scala/org/apache/comet/serde/operator/CometDataWritingCommand.scala Outdated Show resolved Hide resolved

andygrove mentioned this pull request Nov 25, 2025

Implement file commit protocol for native Parquet writes #2827

Open

andygrove and others added 4 commits November 25, 2025 13:36

Update spark/src/main/scala/org/apache/comet/serde/operator/CometData…

ece289a

…WritingCommand.scala Co-authored-by: Zhen Wang <[email protected]>

format

b7daa61

address more feedback

44991b2

cargo fmt

8d9b41a

wForget approved these changes Nov 26, 2025

View reviewed changes

andygrove merged commit 1ec3563 into apache:main Nov 26, 2025
115 checks passed

andygrove deleted the parquet-write-poc branch November 26, 2025 13:47

feat: Add experimental support for native Parquet writes #2812

feat: Add experimental support for native Parquet writes #2812

Uh oh!

Conversation

andygrove commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

parthchandra commented Nov 21, 2025

Uh oh!

codecov-commenter commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wForget Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wForget Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

comphead left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

wForget left a comment

Choose a reason for hiding this comment

Uh oh!

andygrove commented Nov 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

andygrove commented Nov 21, 2025 •

edited

Loading

codecov-commenter commented Nov 21, 2025 •

edited

Loading

wForget Nov 25, 2025 •

edited

Loading

wForget Nov 24, 2025 •

edited

Loading