-
Notifications
You must be signed in to change notification settings - Fork 339
[Datasets] Add PyArrow docs #1839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
nice! |
@@ -18,6 +18,7 @@ The table below summarizes the supported libraries and their level of integratio | |||
| [Polars](./datasets-polars) | A DataFrame library on top of an OLAP query engine. | ✅ | ✅ | | |||
| [Spark](./datasets-spark) | Real-time, large-scale data processing tool in a distributed environment. | ✅ | ✅ | | |||
| [WebDataset](./datasets-webdataset) | Library to write I/O pipelines for large datasets. | ✅ | ❌ | | |||
| [PyArrow](./datasets-pyarrow) | Apache Arrow is a columnar format and a toolbox for fast data interchange and in-memory analytics. | ✅ | ✅ | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think they were alphabetically ordered
pq.write_table(table_test , "hf://datasets/username/my_dataset/test.parquet", use_content_defined_chunking=True) | ||
``` | ||
|
||
We use `use_content_defined_chunking=True` to enable faster uploads and downloads from Hugging Face thanks to Xet deduplication (it requires `pyarrow>=21.0`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you think we could ask them to turn it on by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hard to say, wdyt @kszucs ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe in case of a hf://
URI but that could be fragile to implement since filesystem objects can be passed as well.
I would rather consider the flavor
argument https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html
pq.write_table(table_test , "hf://datasets/username/my_dataset/test.parquet", flavor="huggingface")
... | ||
``` | ||
|
||
Since the dataset is in a [supported structure](https://huggingface.co/docs/hub/en/datasets-image#additional-columns) (a `metadata.parquet` file with a `file_name` field), you can save this dataset to Hugging Face and the Dataset Viewer shows both the metadata and images on Hugging Face. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the dataset is in a [supported structure](https://huggingface.co/docs/hub/en/datasets-image#additional-columns) (a `metadata.parquet` file with a `file_name` field), you can save this dataset to Hugging Face and the Dataset Viewer shows both the metadata and images on Hugging Face. | |
Since the dataset is in a [supported structure](https://huggingface.co/docs/hub/en/datasets-image#additional-columns) (a `metadata.parquet` file with a `file_name` field), you can save this dataset to Hugging Face and the Dataset Viewer shows both the metadata and images. |
redundant? (not sure)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool! Think this could also be helpful for people already heavily using Arrow for science datasets to understand how to get them on the Hub.
huggingface-cli login | ||
``` | ||
|
||
Then you can [Create a dataset repository](/docs/huggingface_hub/quick-start#create-a-repository), for example using: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then you can [Create a dataset repository](/docs/huggingface_hub/quick-start#create-a-repository), for example using: | |
Then you can [create a dataset repository](/docs/huggingface_hub/quick-start#create-a-repository), for example using: |
) | ||
``` | ||
|
||
### Embed Audios inside Parquet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
### Embed Audios inside Parquet | |
### Embed Audio inside Parquet |
|
||
### Embed Audios inside Parquet | ||
|
||
PyArrow has a binary type which allows to have the audios bytes in Arrow tables. Therefore it enables saving the dataset as one single Parquet file containing both the audios (bytes and path) and the samples metadata: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PyArrow has a binary type which allows to have the audios bytes in Arrow tables. Therefore it enables saving the dataset as one single Parquet file containing both the audios (bytes and path) and the samples metadata: | |
PyArrow has a binary type which allows for having audio bytes in Arrow tables. Therefore, it enables saving the dataset as one single Parquet file containing both the audio (bytes and path) and the samples metadata: |
pq.write_table(table, "data.parquet", use_content_defined_chunking=True, row_group_size=100) | ||
``` | ||
|
||
Setting the Audio type in the Arrow schema metadata allows other libraries and the Hugging Face Dataset Viewer to know that "audio" contains audios and not just binary data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting the Audio type in the Arrow schema metadata allows other libraries and the Hugging Face Dataset Viewer to know that "audio" contains audios and not just binary data. | |
Setting the Audio type in the Arrow schema metadata enables other libraries and the Hugging Face Dataset Viewer to recognise that "audio" contains audio data, not just binary data. |
|
||
To load a file from Hugging Face, the path needs to start with `hf://`. For example, the path to the [stanfordnlp/imdb](https://huggingface.co/datasets/stanfordnlp/imdb) dataset repository is `hf://datasets/stanfordnlp/imdb`. The dataset on Hugging Face contains multiple Parquet files. The Parquet file format is designed to make reading and writing data frames efficient, and to make sharing data across data analysis languages easy. Here is how to load the file `plain_text/train-00000-of-00001.parquet` as a pyarrow Table (it requires `pyarrow>=21.0`): | ||
|
||
```python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe
```python | |
```pycon |
(see https://pygments.org/docs/lexers/#pygments.lexers.python.PythonConsoleLexer for example)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait what Oo
I might keep it that way for now for consistency and update the whole docs later if it's really a thing :o
# Save to Parquet | ||
# (Optional) with use_content_defined_chunking for faster uploads and downloads | ||
# (Optional) with row_group_size to allow loading 100 images at a time | ||
pq.write_table(table, "data.parquet", use_content_defined_chunking=True, row_group_size=100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't recommend small row_group_size
values because such a small value effectively disables the benefits of parquet CDC.
pyarrow
21 will be out soon and has an official HF integration :)It also includes Parquet CDC for efficient Xet deduplication for datasets
cc @kszucs