Skip to content

Conversation

@andygrove
Copy link
Member

@andygrove andygrove commented Nov 21, 2025

Which issue does this PR close?

Part of #1625

Rationale for this change

We would eventually like to support native writes to Parquet. This PR adds a starting point for further development.

This is the result of vibe coding with Claude.

The goal is to add the minium possible implementation and test. There are plenty of things that are not implemented or tested yet.

Example of new native plan:

ParquetWriterExec: path=file:/private/var/folders/vv/fmb1n2hx3yqdmxbrv7shzyvr0000gn/T/spark-79afc322-5315-4f47-85dc-c974eeb44d2c/output.parquet, compression=Snappy
  ScanExec: source=[write_source], schema=[col_0: Int32, col_1: Utf8]

What changes are included in this PR?

  • New native ParquetWriterExec
  • New scala CometNativeWriteExec
  • Updates to CometExecRule
  • One working test

How are these changes tested?

New suite added.

@parthchandra
Copy link
Contributor

A good test would be to write with this feature enabled and then read it with and without Comet enabled.

@codecov-commenter
Copy link

codecov-commenter commented Nov 21, 2025

Codecov Report

❌ Patch coverage is 67.47967% with 40 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.16%. Comparing base (f09f8af) to head (8d9b41a).
⚠️ Report is 727 commits behind head on main.

Files with missing lines Patch % Lines
...comet/serde/operator/CometDataWritingCommand.scala 60.86% 17 Missing and 10 partials ⚠️
.../apache/spark/sql/comet/CometNativeWriteExec.scala 66.66% 10 Missing and 2 partials ⚠️
...n/scala/org/apache/comet/rules/CometExecRule.scala 92.30% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2812      +/-   ##
============================================
+ Coverage     56.12%   59.16%   +3.03%     
- Complexity      976     1477     +501     
============================================
  Files           119      167      +48     
  Lines         11743    15188    +3445     
  Branches       2251     2523     +272     
============================================
+ Hits           6591     8986    +2395     
- Misses         4012     4917     +905     
- Partials       1140     1285     +145     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@andygrove andygrove changed the title feat: Add experimental support for native Parquet writes [WIP] feat: Add experimental support for native Parquet writes Nov 21, 2025
@andygrove andygrove marked this pull request as ready for review November 21, 2025 22:35
| BatchScanExec | Yes | Supports Parquet files and Apache Iceberg Parquet scans. See the [Comet Compatibility Guide] for more information. |
| BroadcastExchangeExec | Yes | |
| BroadcastHashJoinExec | Yes | |
| DataWritingCommandExec | No | Experimental support for native Parquet writes. Disabled by default. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it mean also Iceberg writes?

Copy link
Member

@wForget wForget Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we change to InsertIntoHadoopFsRelationCommand here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I updated this.

pub struct ParquetWriterExec {
/// Input execution plan
input: Arc<dyn ExecutionPlan>,
/// Output file path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it file or folder?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is folder. Files named part-*-.parquet will be created within the folder


// Strip file:// or file: prefix if present
let local_path = output_path
.strip_prefix("file://")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if hdfs:// ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a fallback for now so that it falls back to Spark if the path does not start with file:

})?;

// Generate part file name for this partition
let part_file = format!("{}/part-{:05}.parquet", local_path, self.partition_id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't seem right, the extension will be different depending on the codec

.snappy.parquet
.gz.parquet
etc

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file name is best generated by FileCommitProtocol later, so hardcoding it on the native side for now makes sense to me.
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L156-L163

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wForget, are you proposing to keep hardcoded names for the PR and replicate Spark getFilename later?

Copy link
Member

@wForget wForget Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this pr seems to be missing some work related to file commit. My proposed write process might look like this: create a staging dir -> native write files to staging dir -> file commit (move and merge staging files) -> add or update partitions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed #2827 for implementing the file commit protocol.

This PR adds a starting point for development. Once it is merged then other contributors can help add the missing features.


// Execute the write task and convert to a stream
use datafusion::physical_plan::stream::RecordBatchStreamAdapter;
Ok(Box::pin(RecordBatchStreamAdapter::new(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if the partition failed? what would happen with the folder?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷 this is all highly experimental so far

}

message ParquetWriter {
string output_path = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if it is a partiioned writer? df.write.partitionBy()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this PR, we fall back to Spark for now for partitioned writes. There are checks in getSupportLevel.

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andygrove it is a real good start

Copy link
Member

@wForget wForget left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andygrove , lgtm

@andygrove
Copy link
Member Author

Thanks for the reviews @comphead and @wForget. I'm going to go ahead and merge this and will have a draft PR up today for file commit protocol.

@andygrove andygrove merged commit 1ec3563 into apache:main Nov 26, 2025
115 checks passed
@andygrove andygrove deleted the parquet-write-poc branch November 26, 2025 13:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants