Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add implementation status of javascript hyparquet #102

Merged
merged 2 commits into from
Feb 26, 2025

Conversation

platypii
Copy link
Contributor

Adds the JavaScript parquet implementation hyparquet to the support matrix. This library is read-only but has support for almost every parquet encoding and compression format.

I left a couple cells blank because I was unclear what the row was indicating exactly. For example for bloom filters, does this mean that the bloom filters are used for optimized querying? or just that the metadata is available to clients to use? Same with Size statistics.

Also happy to update if there's a better way to indicate a read-only implementation.

Disclosure: I am the primary author of hyparquet

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your update! Really appreciate it.

I left a couple cells blank because I was unclear what the row was indicating exactly. For example for bloom filters, does this mean that the bloom filters are used for optimized querying? or just that the metadata is available to clients to use? Same with Size statistics.

I think it means that the implementation has read and/or write support to the metadata.

Also happy to update if there's a better way to indicate a read-only implementation.

What about using 🇼 and 🇷 respectively? I searched them from https://emojipedia.org.

@@ -22,91 +22,91 @@ Implementations:
* `Go`: [parquet-go](https://github.com/apache/arrow-go/tree/main/parquet)
* `Rust`: [parquet-rs](https://github.com/apache/arrow-rs/blob/main/parquet/README.md)
* `cuDF`: [cudf](https://github.com/rapidsai/cudf)

* `JavaScript`: [hyparquet](https://github.com/hyparam/hyparquet)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* `JavaScript`: [hyparquet](https://github.com/hyparam/hyparquet)
* `hyparquet`: [hyparquet](https://github.com/hyparam/hyparquet)

There was a similar discussion about the name: #99 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the list items besides cuDF put the language before the colon, not the library name. I'm happy to make changes but I think it would be nice if it mentioned the language somewhere. So that users who are working in a particular language might know what implementations to look at? Open to suggestions.

I would personally advocate for having the language listed in the top "Implementations" list, and then in the tables put the specific implementation name (since there could be multiple libraries in the same language).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option would be to remove the pre-colon and put language in parens after?

Also one could argue that the existing link names are wrong, since the implementation names are actually arrow, arrow-go, and arrow-rs. Should those names also be in table headers below instead of the languages?

This format would better support the case of multiple implementations per language:

Suggested change
* `JavaScript`: [hyparquet](https://github.com/hyparam/hyparquet)
* [arrow](https://github.com/apache/arrow/tree/main/cpp/src/parquet) (C++)
* [parquet-java](https://github.com/apache/parquet-java) (Java)
* [arrow-go](https://github.com/apache/arrow-go/tree/main/parquet) (Go)
* [arrow-rs](https://github.com/apache/arrow-rs/blob/main/parquet/README.md) (Rust)
* [cudf](https://github.com/rapidsai/cudf) (CUDA)
* [hyparquet](https://github.com/hyparam/hyparquet) (JavaScript)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the 2nd option

@platypii
Copy link
Contributor Author

I think it means that the implementation has read and/or write support to the metadata.

For hyparquet there is a function parquetMetadata which returns the entire parquet metadata as json. So a user of the library could technically read the bloom filter metadata. However, the library does not yet use bloom filters for querying, which seems like the operation that people would actually care about. So I am unsure whether to check that cell or not.

@platypii
Copy link
Contributor Author

platypii commented Feb 21, 2025

What about using 🇼 and 🇷 respectively? I searched them from https://emojipedia.org.

I was following the (R) convention used in existing rows. I am happy to make this change but should I do it just for hyparquet, or should I also update the existing rows? Maybe a better as a separate PR? Here's what it would look like:
readonly

@wgtmac
Copy link
Member

wgtmac commented Feb 21, 2025

However, the library does not yet use bloom filters for querying, which seems like the operation that people would actually care about. So I am unsure whether to check that cell or not.

IMHO, if hyparquet is able to return a deserialized bloom filter from the file then it can be marked as (R).

Maybe a better as a separate PR?

Agreed

@wgtmac wgtmac merged commit c6af28b into apache:production Feb 26, 2025
1 check passed
@wgtmac
Copy link
Member

wgtmac commented Feb 26, 2025

I've merged this. Thanks @platypii and look forward to future update!

@platypii platypii deleted the javascript branch February 26, 2025 03:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants