Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add implementation status for cuDF #99

Merged
merged 8 commits into from
Feb 5, 2025

Conversation

mhaseeb123
Copy link
Contributor

@mhaseeb123 mhaseeb123 commented Jan 29, 2025

This PR adds the implementation status for cuDF to Parquet site.

@mhaseeb123 mhaseeb123 changed the title Add implementation status for cuDF 🚧 Add implementation status for cuDF Jan 29, 2025
| External column data (1) | | | | | (W) |
| Row group "Sorting column" metadata (2) | | | | | (W) |
| Row group pruning using statistics | | | | | ✅ |
| Row group pruning using bloom filter | | | | | ✅ |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please correct me if I am wrong but I believe the bloom filters are used to prune row groups instead of pages.

@mhaseeb123 mhaseeb123 requested a review from vuule January 29, 2025 20:58
@mhaseeb123 mhaseeb123 marked this pull request as ready for review January 30, 2025 23:26
@mhaseeb123 mhaseeb123 changed the title 🚧 Add implementation status for cuDF Add implementation status for cuDF Jan 30, 2025
@@ -13,94 +13,96 @@ implementations.
The value in each box means:
* ✅: supported
* ❌: not supported
* (R/W): partial reader/writer only support
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an extra piece in legend to allow partial reader- or writer-only support. Happy to remove it and leave the corresponding boxes blank if needed

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update!

* (blank) no data

Implementations:
* `C++`: [parquet-cpp](https://github.com/apache/arrow/tree/main/cpp/src/parquet)
* `Java`: [parquet-java](https://github.com/apache/parquet-java)
* `Go`: [parquet-go](https://github.com/apache/arrow-go/tree/main/parquet)
* `Rust`: [parquet-rs](https://github.com/apache/arrow-rs/blob/main/parquet/README.md)
* `CUDA C++`: [cudf](https://github.com/rapidsai/cudf)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be cuDF? Or CUDA C++ is a more official name of it?

Copy link
Contributor Author

@mhaseeb123 mhaseeb123 Feb 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cuDF is the name of the implementing dataframes library and CUDA C++ is the language being used for implementation. Isn't the convention here like:

* `language`: [impl name](link)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer cuDF here. I think the original intention was to include implementations governed by the Parquet community or the Apache Software Foundation. It would be better to use the library name to encourage other Parquet implementations to appear here. WDYT? @alamb

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also recall that the idea here was to list library names (so this would be better as cuDF) not languages.

It just so happens that we only had one example library for each language so there was (before this PR) a 1-1 correspondence.

Does that make sense @mhaseeb123 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I will update this

@wgtmac
Copy link
Member

wgtmac commented Feb 3, 2025

cc @etseidl @alamb

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for getting the party started @mhaseeb123!

@mhaseeb123 mhaseeb123 requested a review from etseidl February 3, 2025 23:09
Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now. Thanks!

@alamb
Copy link
Collaborator

alamb commented Feb 4, 2025

This PR adds the implementation status for cuDF to Parquet site.

AMAZING! Thank you @mhaseeb123

I wonder if you had any program / script / definition of what "support" means (mostly so I can crib / copy that and file a ticket in the arrow-rs repository to get this column filled out)

@mhaseeb123
Copy link
Contributor Author

I wonder if you had any program / script / definition of what "support" means (mostly so I can crib / copy that and file a ticket in the arrow-rs repository to get this column filled out)

Certainly, the (R) label I used means the cuDF parquet reader supports decompressing a codec, decoding an encoding type, reading (and using) bloom filters but the writer can't compress/encode/write those codecs/encodings/bloom filters respectively depending on the sub-section it's used in. Similarly, a (W) label would mean the opposite that the writer can write a certain field or feature but the reader is unable to read/use it.

Does that make sense?




### Physical types

| Data type | C++ | Java | Go | Rust |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simply removed one space in the Java column so all cols have a consistent width for aesthetic purposes.

@alamb
Copy link
Collaborator

alamb commented Feb 4, 2025

Certainly, the (R) label I used means the cuDF parquet reader supports decompressing a codec, decoding an encoding type, reading (and using) bloom filters but the writer can't compress/encode/write those codecs/encodings/bloom filters respectively depending on the sub-section it's used in. Similarly, a (W) label would mean the opposite that the writer can write a certain field or feature but the reader is unable to read/use it.

Does that make sense?

Yes for sure -- I guess i was hoping for some sort of script / example data that I could used when filling this out for arrow-rs. Not required, I was just asking

@mhaseeb123
Copy link
Contributor Author

I guess i was hoping for some sort of script / example data that I could used when filling this out for arrow-rs. Not required, I was just asking

We have relevant gtests and pytests in cudf for most if not all the features but collecting them along with input/output files wouldn't be feasible. Sorry!

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Thanks @mhaseeb123 and @bdice @vuule @etseidl @alamb for review!

@wgtmac wgtmac merged commit 4557062 into apache:production Feb 5, 2025
1 check passed
@alamb
Copy link
Collaborator

alamb commented Feb 7, 2025

🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants