-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add implementation status for cuDF #99
Conversation
| External column data (1) | | | | | (W) | | ||
| Row group "Sorting column" metadata (2) | | | | | (W) | | ||
| Row group pruning using statistics | | | | | ✅ | | ||
| Row group pruning using bloom filter | | | | | ✅ | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please correct me if I am wrong but I believe the bloom filters are used to prune row groups instead of pages.
Co-authored-by: Bradley Dice <[email protected]>
@@ -13,94 +13,96 @@ implementations. | |||
The value in each box means: | |||
* ✅: supported | |||
* ❌: not supported | |||
* (R/W): partial reader/writer only support |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added an extra piece in legend to allow partial reader- or writer-only support. Happy to remove it and leave the corresponding boxes blank if needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update!
* (blank) no data | ||
|
||
Implementations: | ||
* `C++`: [parquet-cpp](https://github.com/apache/arrow/tree/main/cpp/src/parquet) | ||
* `Java`: [parquet-java](https://github.com/apache/parquet-java) | ||
* `Go`: [parquet-go](https://github.com/apache/arrow-go/tree/main/parquet) | ||
* `Rust`: [parquet-rs](https://github.com/apache/arrow-rs/blob/main/parquet/README.md) | ||
* `CUDA C++`: [cudf](https://github.com/rapidsai/cudf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be cuDF
? Or CUDA C++
is a more official name of it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cuDF
is the name of the implementing dataframes library and CUDA C++
is the language being used for implementation. Isn't the convention here like:
* `language`: [impl name](link)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer cuDF
here. I think the original intention was to include implementations governed by the Parquet community or the Apache Software Foundation. It would be better to use the library name to encourage other Parquet implementations to appear here. WDYT? @alamb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also recall that the idea here was to list library names (so this would be better as cuDF
) not languages.
It just so happens that we only had one example library for each language so there was (before this PR) a 1-1 correspondence.
Does that make sense @mhaseeb123 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I will update this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for getting the party started @mhaseeb123!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good now. Thanks!
AMAZING! Thank you @mhaseeb123 I wonder if you had any program / script / definition of what "support" means (mostly so I can crib / copy that and file a ticket in the arrow-rs repository to get this column filled out) |
Certainly, the Does that make sense? |
|
||
|
||
|
||
### Physical types | ||
|
||
| Data type | C++ | Java | Go | Rust | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simply removed one space in the Java
column so all cols have a consistent width for aesthetic purposes.
Yes for sure -- I guess i was hoping for some sort of script / example data that I could used when filling this out for arrow-rs. Not required, I was just asking |
We have relevant gtests and pytests in cudf for most if not all the features but collecting them along with input/output files wouldn't be feasible. Sorry! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Thanks @mhaseeb123 and @bdice @vuule @etseidl @alamb for review!
🚀 |
This PR adds the implementation status for cuDF to Parquet site.