Add implementation status for cuDF #99

mhaseeb123 · 2025-01-29T01:31:55Z

This PR adds the implementation status for cuDF to Parquet site.

mhaseeb123 · 2025-01-29T01:43:14Z

content/en/docs/File Format/implementationstatus.md

+| External column data (1)                     |       |        |       |       |  (W)  |
+| Row group "Sorting column" metadata (2)      |       |        |       |       |  (W)  |
+| Row group pruning using statistics           |       |        |       |       |  ✅   |
+| Row group pruning using bloom filter         |       |        |       |       |  ✅   |


Please correct me if I am wrong but I believe the bloom filters are used to prune row groups instead of pages.

content/en/docs/File Format/implementationstatus.md

Co-authored-by: Bradley Dice <[email protected]>

content/en/docs/File Format/implementationstatus.md

mhaseeb123 · 2025-01-30T23:27:28Z

content/en/docs/File Format/implementationstatus.md

@@ -13,94 +13,96 @@ implementations.
 The value in each box means:
 * ✅: supported
 * ❌: not supported
+* (R/W): partial reader/writer only support


Added an extra piece in legend to allow partial reader- or writer-only support. Happy to remove it and leave the corresponding boxes blank if needed

wgtmac

Thanks for the update!

wgtmac · 2025-02-02T03:02:53Z

content/en/docs/File Format/implementationstatus.md

 * (blank) no data

 Implementations:
 * `C++`: [parquet-cpp](https://github.com/apache/arrow/tree/main/cpp/src/parquet)
 * `Java`: [parquet-java](https://github.com/apache/parquet-java)
 * `Go`: [parquet-go](https://github.com/apache/arrow-go/tree/main/parquet)
 * `Rust`: [parquet-rs](https://github.com/apache/arrow-rs/blob/main/parquet/README.md)
+* `CUDA C++`: [cudf](https://github.com/rapidsai/cudf)


Should this be cuDF? Or CUDA C++ is a more official name of it?

cuDF is the name of the implementing dataframes library and CUDA C++ is the language being used for implementation. Isn't the convention here like:

* `language`: [impl name](link)

I would prefer cuDF here. I think the original intention was to include implementations governed by the Parquet community or the Apache Software Foundation. It would be better to use the library name to encourage other Parquet implementations to appear here. WDYT? @alamb

I also recall that the idea here was to list library names (so this would be better as cuDF) not languages.

It just so happens that we only had one example library for each language so there was (before this PR) a 1-1 correspondence.

Does that make sense @mhaseeb123 ?

Sounds good. I will update this

wgtmac · 2025-02-03T09:32:39Z

cc @etseidl @alamb

etseidl

Thanks for getting the party started @mhaseeb123!

content/en/docs/File Format/implementationstatus.md

etseidl

Looks good now. Thanks!

alamb · 2025-02-04T14:04:51Z

This PR adds the implementation status for cuDF to Parquet site.

AMAZING! Thank you @mhaseeb123

I wonder if you had any program / script / definition of what "support" means (mostly so I can crib / copy that and file a ticket in the arrow-rs repository to get this column filled out)

mhaseeb123 · 2025-02-04T19:15:39Z

I wonder if you had any program / script / definition of what "support" means (mostly so I can crib / copy that and file a ticket in the arrow-rs repository to get this column filled out)

Certainly, the (R) label I used means the cuDF parquet reader supports decompressing a codec, decoding an encoding type, reading (and using) bloom filters but the writer can't compress/encode/write those codecs/encodings/bloom filters respectively depending on the sub-section it's used in. Similarly, a (W) label would mean the opposite that the writer can write a certain field or feature but the reader is unable to read/use it.

Does that make sense?

mhaseeb123 · 2025-02-04T19:17:33Z

content/en/docs/File Format/implementationstatus.md




 ### Physical types

-| Data type                                 | C++   | Java   | Go    | Rust  |


Simply removed one space in the Java column so all cols have a consistent width for aesthetic purposes.

alamb · 2025-02-04T20:41:02Z

Certainly, the (R) label I used means the cuDF parquet reader supports decompressing a codec, decoding an encoding type, reading (and using) bloom filters but the writer can't compress/encode/write those codecs/encodings/bloom filters respectively depending on the sub-section it's used in. Similarly, a (W) label would mean the opposite that the writer can write a certain field or feature but the reader is unable to read/use it.

Does that make sense?

Yes for sure -- I guess i was hoping for some sort of script / example data that I could used when filling this out for arrow-rs. Not required, I was just asking

mhaseeb123 · 2025-02-05T02:14:38Z

I guess i was hoping for some sort of script / example data that I could used when filling this out for arrow-rs. Not required, I was just asking

We have relevant gtests and pytests in cudf for most if not all the features but collecting them along with input/output files wouldn't be feasible. Sorry!

wgtmac

+1

Thanks @mhaseeb123 and @bdice @vuule @etseidl @alamb for review!

alamb · 2025-02-07T18:16:43Z

🚀

Add implementation status for cudf

8d8c94d

mhaseeb123 changed the title ~~Add implementation status for cuDF~~ 🚧 Add implementation status for cuDF Jan 29, 2025

mhaseeb123 commented Jan 29, 2025

View reviewed changes

bdice reviewed Jan 29, 2025

View reviewed changes

content/en/docs/File Format/implementationstatus.md Outdated Show resolved Hide resolved

content/en/docs/File Format/implementationstatus.md Outdated Show resolved Hide resolved

mhaseeb123 and others added 2 commits January 28, 2025 18:27

Update content/en/docs/File Format/implementationstatus.md

2fdbd6e

Co-authored-by: Bradley Dice <[email protected]>

Change CUDA to CUDA C++

ec8b66e

vuule reviewed Jan 29, 2025

View reviewed changes

content/en/docs/File Format/implementationstatus.md Outdated Show resolved Hide resolved

vuule reviewed Jan 29, 2025

View reviewed changes

content/en/docs/File Format/implementationstatus.md Outdated Show resolved Hide resolved

Updates from reviews

3836c22

mhaseeb123 requested a review from vuule January 29, 2025 20:58

mhaseeb123 added 2 commits January 29, 2025 15:39

Updates for BIT_PACKED reader support

b178477

Update reader-only support for bloom filters

8ee07cc

vuule approved these changes Jan 30, 2025

View reviewed changes

mhaseeb123 marked this pull request as ready for review January 30, 2025 23:26

mhaseeb123 changed the title ~~🚧 Add implementation status for cuDF~~ Add implementation status for cuDF Jan 30, 2025

mhaseeb123 commented Jan 30, 2025

View reviewed changes

wgtmac reviewed Feb 2, 2025

View reviewed changes

etseidl reviewed Feb 3, 2025

View reviewed changes

content/en/docs/File Format/implementationstatus.md Outdated Show resolved Hide resolved

content/en/docs/File Format/implementationstatus.md Outdated Show resolved Hide resolved

Apply suggestions

7348ec4

mhaseeb123 requested a review from etseidl February 3, 2025 23:09

mhaseeb123 commented Feb 3, 2025

View reviewed changes

content/en/docs/File Format/implementationstatus.md Outdated Show resolved Hide resolved

etseidl approved these changes Feb 3, 2025

View reviewed changes

Use impl name (cuDF) instead of language (CUDA C++)

3bf79d7

mhaseeb123 commented Feb 4, 2025

View reviewed changes

wgtmac approved these changes Feb 5, 2025

View reviewed changes

wgtmac merged commit 4557062 into apache:production Feb 5, 2025
1 check passed

etseidl mentioned this pull request Feb 6, 2025

Add arrow-rs column to Parquet implementation page on parquet-site apache/arrow-rs#7088

Open

wgtmac mentioned this pull request Feb 20, 2025

Add implementation status of javascript hyparquet #102

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add implementation status for cuDF #99

Add implementation status for cuDF #99

mhaseeb123 commented Jan 29, 2025 •

edited

Loading

mhaseeb123 Jan 29, 2025

mhaseeb123 Jan 30, 2025

wgtmac left a comment

wgtmac Feb 2, 2025

mhaseeb123 Feb 3, 2025 •

edited

Loading

wgtmac Feb 4, 2025

alamb Feb 4, 2025

mhaseeb123 Feb 4, 2025

wgtmac commented Feb 3, 2025

etseidl left a comment

etseidl left a comment

alamb commented Feb 4, 2025

mhaseeb123 commented Feb 4, 2025

mhaseeb123 Feb 4, 2025

alamb commented Feb 4, 2025

mhaseeb123 commented Feb 5, 2025

wgtmac left a comment

alamb commented Feb 7, 2025

Add implementation status for cuDF #99

Add implementation status for cuDF #99

Conversation

mhaseeb123 commented Jan 29, 2025 • edited Loading

mhaseeb123 Jan 29, 2025

Choose a reason for hiding this comment

mhaseeb123 Jan 30, 2025

Choose a reason for hiding this comment

wgtmac left a comment

Choose a reason for hiding this comment

wgtmac Feb 2, 2025

Choose a reason for hiding this comment

mhaseeb123 Feb 3, 2025 • edited Loading

Choose a reason for hiding this comment

wgtmac Feb 4, 2025

Choose a reason for hiding this comment

alamb Feb 4, 2025

Choose a reason for hiding this comment

mhaseeb123 Feb 4, 2025

Choose a reason for hiding this comment

wgtmac commented Feb 3, 2025

etseidl left a comment

Choose a reason for hiding this comment

etseidl left a comment

Choose a reason for hiding this comment

alamb commented Feb 4, 2025

mhaseeb123 commented Feb 4, 2025

mhaseeb123 Feb 4, 2025

Choose a reason for hiding this comment

alamb commented Feb 4, 2025

mhaseeb123 commented Feb 5, 2025

wgtmac left a comment

Choose a reason for hiding this comment

alamb commented Feb 7, 2025

mhaseeb123 commented Jan 29, 2025 •

edited

Loading

mhaseeb123 Feb 3, 2025 •

edited

Loading