Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add implementation status for cuDF #99

Merged
merged 8 commits into from
Feb 5, 2025
140 changes: 71 additions & 69 deletions content/en/docs/File Format/implementationstatus.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,94 +13,96 @@ implementations.
The value in each box means:
* ✅: supported
* ❌: not supported
* (R/W): partial reader/writer support
* (blank) no data

Implementations:
* `C++`: [parquet-cpp](https://github.com/apache/arrow/tree/main/cpp/src/parquet)
* `Java`: [parquet-java](https://github.com/apache/parquet-java)
* `Go`: [parquet-go](https://github.com/apache/arrow-go/tree/main/parquet)
* `Rust`: [parquet-rs](https://github.com/apache/arrow-rs/blob/main/parquet/README.md)
* `CUDA`:[cudf](https://github.com/rapidsai/cudf)



### Physical types

| Data type | C++ | Java | Go | Rust |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simply removed one space in the Java column so all cols have a consistent width for aesthetic purposes.

| ----------------------------------------- | ----- | ------ | ----- | ----- |
| BOOLEAN | | | | |
| INT32 | | | | |
| INT64 | | | | |
| INT96 (1) | | | | |
| FLOAT | | | | |
| DOUBLE | | | | |
| BYTE_ARRAY | | | | |
| FIXED_LEN_BYTE_ARRAY | | | | |
| Data type | C++ | Java | Go | Rust | CUDA |
| ----------------------------------------- | ----- | ------ | ----- | ----- | ----- |
| BOOLEAN | | | | | ✅ |
| INT32 | | | | | ✅ |
| INT64 | | | | | ✅ |
| INT96 (1) | | | | | ✅ |
| FLOAT | | | | | ✅ |
| DOUBLE | | | | | ✅ |
| BYTE_ARRAY | | | | | ✅ |
| FIXED_LEN_BYTE_ARRAY | | | | | ✅ |

* \(1) This type is deprecated, but as of 2024 it's common in currently produced parquet files


### Logical types

| Data type | C++ | Java | Go | Rust |
| ----------------------------------------- | ----- | ------ | ----- | ----- |
| STRING | | | | |
| ENUM | | | | |
| UUID | | | | |
| 8, 16, 32, 64 bit signed and unsigned INT | | | | |
| DECIMAL (INT32) | | | | |
| DECIMAL (INT64) | | | | |
| DECIMAL (BYTE_ARRAY) | | | | |
| DECIMAL (FIXED_LEN_BYTE_ARRAY) | | | | |
| DATE | | | | |
| TIME (INT32) | | | | |
| TIME (INT64) | | | | |
| TIMESTAMP (INT64) | | | | |
| INTERVAL | | | | |
| JSON | | | | |
| BSON | | | | |
| LIST | | | | |
| MAP | | | | |
| UNKNOWN (always null) | | | | |
| FLOAT16 | | | | |
| Data type | C++ | Java | Go | Rust | CUDA |
| ----------------------------------------- | ----- | ------ | ----- | ----- | ----- |
| STRING | | | | | ✅ |
| ENUM | | | | | ❌ |
| UUID | | | | | ❌ |
| 8, 16, 32, 64 bit signed and unsigned INT | | | | | ✅ |
| DECIMAL (INT32) | | | | | ✅ |
| DECIMAL (INT64) | | | | | ✅ |
| DECIMAL (BYTE_ARRAY) | | | | | ✅ |
| DECIMAL (FIXED_LEN_BYTE_ARRAY) | | | | | ✅ |
| DATE | | | | | ✅ |
| TIME (INT32) | | | | | ✅ |
| TIME (INT64) | | | | | ✅ |
| TIMESTAMP (INT64) | | | | | ✅ |
| INTERVAL | | | | | ❌ |
| JSON | | | | | ❌ |
| BSON | | | | | ❌ |
| LIST | | | | | ✅ |
| MAP | | | | | ✅ |
| UNKNOWN (always null) | | | | | ✅ |
| FLOAT16 | | | | | ✅ |

### Encodings

| Encoding | C++ | Java | Go | Rust |
| ----------------------------------------- | ----- | ------ | ----- | ----- |
| PLAIN | | | | |
| PLAIN_DICTIONARY | | | | |
| RLE_DICTIONARY | | | | |
| RLE | | | | |
| BIT_PACKED (deprecated) | | | | |
| DELTA_BINARY_PACKED | | | | |
| DELTA_LENGTH_BYTE_ARRAY | | | | |
| DELTA_BYTE_ARRAY | | | | |
| BYTE_STREAM_SPLIT | | | | |
| Encoding | C++ | Java | Go | Rust | CUDA |
| ----------------------------------------- | ----- | ------ | ----- | ----- | ----- |
| PLAIN | | | | | ✅ |
| PLAIN_DICTIONARY | | | | | ✅ |
| RLE_DICTIONARY | | | | | ✅ |
| RLE | | | | | ❌ |
| BIT_PACKED (deprecated) | | | | | ❌ |
| DELTA_BINARY_PACKED | | | | | ✅ |
| DELTA_LENGTH_BYTE_ARRAY | | | | | ✅ |
| DELTA_BYTE_ARRAY | | | | | ✅ |
| BYTE_STREAM_SPLIT | | | | | ✅ |

### Compressions

| Compression | C++ | Java | Go | Rust |
| ----------------------------------------- | ----- | ------ | ----- | ----- |
| UNCOMPRESSED | | | | |
| BROTLI | | | | |
| GZIP | | | | |
| LZ4 (deprecated) | | | | |
| LZ4_RAW | | | | |
| LZO | | | | |
| SNAPPY | | | | |
| ZSTD | | | | |
| Compression | C++ | Java | Go | Rust | CUDA |
| ----------------------------------------- | ----- | ------ | ----- | ----- | ----- |
| UNCOMPRESSED | | | | | ✅ |
| BROTLI | | | | | ❌ |
| GZIP | | | | | ❌ |
| LZ4 (deprecated) | | | | | ✅ |
| LZ4_RAW | | | | | ❌ |
| LZO | | | | | ❌ |
| SNAPPY | | | | | ✅ |
| ZSTD | | | | | ✅ |

### Other format level features

| | C++ | Java | Go | Rust |
| ----------------------------------------- | ----- | ------ | ----- | ----- |
| xxxHash-based bloom filters | | | | |
| Bloom filter length (1) | | | | |
| Statistics min_value, max_value | | | | |
| Page index | | | | |
| Page CRC32 checksum | | | | |
| Modular encryption | | | | |
| Size statistics (2) | | | | |
| | C++ | Java | Go | Rust | CUDA |
| ----------------------------------------- | ----- | ------ | ----- | ----- | ----- |
| xxxHash-based bloom filters | | | | | ✅ |
| Bloom filter length (1) | | | | | (R) |
| Statistics min_value, max_value | | | | | ✅ |
| Page index | | | | | ❌ |
| Page CRC32 checksum | | | | | ❌ |
| Modular encryption | | | | | ❌ |
| Size statistics (2) | | | | | ❌ |


* \(1) In parquet.thrift: ColumnMetaData->bloom_filter_length
Expand All @@ -109,14 +111,14 @@ Implementations:

### High level data APIs for Parquet feature usage

| Format | C++ | Java | Go | Rust |
| -------------------------------------------- | ----- | ------ | ----- | ----- |
| External column data (1) | | | | |
| Row group "Sorting column" metadata (2) | | | | |
| Row group pruning using statistics | | | | |
| Reading select columns only | | | | |
| Page pruning using statistics | | | | |
| Page pruning using bloom filter | | | | |
| Format | C++ | Java | Go | Rust | CUDA |
| -------------------------------------------- | ----- | ------ | ----- | ----- | ----- |
| External column data (1) | | | | | (W) |
| Row group "Sorting column" metadata (2) | | | | | (W) |
| Row group pruning using statistics | | | | | ✅ |
| Row group pruning using bloom filter | | | | | |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please correct me if I am wrong but I believe the bloom filters are used to prune row groups instead of pages.

| Reading select columns only | | | | | ✅ |
| Page pruning using statistics | | | | | ❌ |


* \(1) In parquet.thrift: ColumnChunk->file_path
Expand Down