[Datasets] Add PyArrow docs #1839

lhoestq · 2025-07-15T16:20:58Z

pyarrow 21 will be out soon and has an official HF integration :)

It also includes Parquet CDC for efficient Xet deduplication for datasets

HuggingFaceDocBuilderDev · 2025-07-15T16:22:30Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

julien-c · 2025-07-15T17:12:39Z

nice!

julien-c · 2025-07-15T17:13:55Z

docs/hub/datasets-libraries.md

@@ -18,6 +18,7 @@ The table below summarizes the supported libraries and their level of integratio
 | [Polars](./datasets-polars)         | A DataFrame library on top of an OLAP query engine.                                                                            | ✅                | ✅          |
 | [Spark](./datasets-spark)           | Real-time, large-scale data processing tool in a distributed environment.                                                      | ✅                | ✅          |
 | [WebDataset](./datasets-webdataset) | Library to write I/O pipelines for large datasets.                                                                             | ✅                | ❌          |
+| [PyArrow](./datasets-pyarrow)       | Apache Arrow is a columnar format and a toolbox for fast data interchange and in-memory analytics.                             | ✅                | ✅          |


i think they were alphabetically ordered

julien-c · 2025-07-15T17:17:39Z

docs/hub/datasets-pyarrow.md

+pq.write_table(table_test , "hf://datasets/username/my_dataset/test.parquet", use_content_defined_chunking=True)
+```
+
+We use `use_content_defined_chunking=True` to enable faster uploads and downloads from Hugging Face thanks to Xet deduplication (it requires `pyarrow>=21.0`).


do you think we could ask them to turn it on by default?

hard to say, wdyt @kszucs ?

Maybe in case of a hf:// URI but that could be fragile to implement since filesystem objects can be passed as well.
I would rather consider the flavor argument https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html

pq.write_table(table_test , "hf://datasets/username/my_dataset/test.parquet", flavor="huggingface")

julien-c · 2025-07-15T17:20:12Z

docs/hub/datasets-pyarrow.md

+    ...
+```
+
+Since the dataset is in a [supported structure](https://huggingface.co/docs/hub/en/datasets-image#additional-columns) (a `metadata.parquet` file with a `file_name` field), you can save this dataset to Hugging Face and the Dataset Viewer shows both the metadata and images on Hugging Face.


Suggested change

Since the dataset is in a [supported structure](https://huggingface.co/docs/hub/en/datasets-image#additional-columns) (a `metadata.parquet` file with a `file_name` field), you can save this dataset to Hugging Face and the Dataset Viewer shows both the metadata and images on Hugging Face.

Since the dataset is in a [supported structure](https://huggingface.co/docs/hub/en/datasets-image#additional-columns) (a `metadata.parquet` file with a `file_name` field), you can save this dataset to Hugging Face and the Dataset Viewer shows both the metadata and images.

redundant? (not sure)

davanstrien

Very cool! Think this could also be helpful for people already heavily using Arrow for science datasets to understand how to get them on the Hub.

davanstrien · 2025-07-15T17:31:33Z

docs/hub/datasets-pyarrow.md

+huggingface-cli login
+```
+
+Then you can [Create a dataset repository](/docs/huggingface_hub/quick-start#create-a-repository), for example using:


Suggested change

Then you can [Create a dataset repository](/docs/huggingface_hub/quick-start#create-a-repository), for example using:

Then you can [create a dataset repository](/docs/huggingface_hub/quick-start#create-a-repository), for example using:

davanstrien · 2025-07-15T17:33:47Z

docs/hub/datasets-pyarrow.md

+)
+```
+
+### Embed Audios inside Parquet


Suggested change

### Embed Audios inside Parquet

### Embed Audio inside Parquet

davanstrien · 2025-07-15T17:34:11Z

docs/hub/datasets-pyarrow.md

+
+### Embed Audios inside Parquet
+
+PyArrow has a binary type which allows to have the audios bytes in Arrow tables. Therefore it enables saving the dataset as one single Parquet file containing both the audios (bytes and path) and the samples metadata:


Suggested change

PyArrow has a binary type which allows to have the audios bytes in Arrow tables. Therefore it enables saving the dataset as one single Parquet file containing both the audios (bytes and path) and the samples metadata:

PyArrow has a binary type which allows for having audio bytes in Arrow tables. Therefore, it enables saving the dataset as one single Parquet file containing both the audio (bytes and path) and the samples metadata:

davanstrien · 2025-07-15T17:34:43Z

docs/hub/datasets-pyarrow.md

+pq.write_table(table, "data.parquet", use_content_defined_chunking=True, row_group_size=100)
+```
+
+Setting the Audio type in the Arrow schema metadata allows other libraries and the Hugging Face Dataset Viewer to know that "audio" contains audios and not just binary data.


Suggested change

Setting the Audio type in the Arrow schema metadata allows other libraries and the Hugging Face Dataset Viewer to know that "audio" contains audios and not just binary data.

Setting the Audio type in the Arrow schema metadata enables other libraries and the Hugging Face Dataset Viewer to recognise that "audio" contains audio data, not just binary data.

severo · 2025-07-15T18:21:10Z

docs/hub/datasets-pyarrow.md

+
+To load a file from Hugging Face, the path needs to start with `hf://`. For example, the path to the [stanfordnlp/imdb](https://huggingface.co/datasets/stanfordnlp/imdb) dataset repository is `hf://datasets/stanfordnlp/imdb`. The dataset on Hugging Face contains multiple Parquet files. The Parquet file format is designed to make reading and writing data frames efficient, and to make sharing data across data analysis languages easy. Here is how to load the file `plain_text/train-00000-of-00001.parquet` as a pyarrow Table (it requires `pyarrow>=21.0`):
+
+```python


maybe

Suggested change

```python

```pycon

(see https://pygments.org/docs/lexers/#pygments.lexers.python.PythonConsoleLexer for example)

wait what Oo

I might keep it that way for now for consistency and update the whole docs later if it's really a thing :o

kszucs · 2025-07-16T16:02:27Z

docs/hub/datasets-pyarrow.md

+# Save to Parquet
+# (Optional) with use_content_defined_chunking for faster uploads and downloads
+# (Optional) with row_group_size to allow loading 100 images at a time
+pq.write_table(table, "data.parquet", use_content_defined_chunking=True, row_group_size=100)


We shouldn't recommend small row_group_size values because such a small value effectively disables the benefits of parquet CDC.

add pyarrow docs

9396eb7

lhoestq requested a review from davanstrien July 15, 2025 16:20

julien-c reviewed Jul 15, 2025

View reviewed changes

davanstrien approved these changes Jul 15, 2025

View reviewed changes

severo reviewed Jul 15, 2025

View reviewed changes

kszucs reviewed Jul 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Datasets] Add PyArrow docs #1839

[Datasets] Add PyArrow docs #1839

Uh oh!

lhoestq commented Jul 15, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Jul 15, 2025

Uh oh!

julien-c commented Jul 15, 2025

Uh oh!

julien-c Jul 15, 2025

Uh oh!

julien-c Jul 15, 2025

Uh oh!

lhoestq Jul 16, 2025

Uh oh!

kszucs Jul 16, 2025

Uh oh!

julien-c Jul 15, 2025

Uh oh!

davanstrien left a comment

Uh oh!

davanstrien Jul 15, 2025

Uh oh!

davanstrien Jul 15, 2025

Uh oh!

davanstrien Jul 15, 2025

Uh oh!

davanstrien Jul 15, 2025

Uh oh!

severo Jul 15, 2025

Uh oh!

lhoestq Jul 16, 2025

Uh oh!

kszucs Jul 16, 2025

Uh oh!

Uh oh!

	Since the dataset is in a [supported structure](https://huggingface.co/docs/hub/en/datasets-image#additional-columns) (a `metadata.parquet` file with a `file_name` field), you can save this dataset to Hugging Face and the Dataset Viewer shows both the metadata and images on Hugging Face.
	Since the dataset is in a [supported structure](https://huggingface.co/docs/hub/en/datasets-image#additional-columns) (a `metadata.parquet` file with a `file_name` field), you can save this dataset to Hugging Face and the Dataset Viewer shows both the metadata and images.

	Then you can [Create a dataset repository](/docs/huggingface_hub/quick-start#create-a-repository), for example using:
	Then you can [create a dataset repository](/docs/huggingface_hub/quick-start#create-a-repository), for example using:

	### Embed Audios inside Parquet
	### Embed Audio inside Parquet


		### Embed Audios inside Parquet

		PyArrow has a binary type which allows to have the audios bytes in Arrow tables. Therefore it enables saving the dataset as one single Parquet file containing both the audios (bytes and path) and the samples metadata:

	PyArrow has a binary type which allows to have the audios bytes in Arrow tables. Therefore it enables saving the dataset as one single Parquet file containing both the audios (bytes and path) and the samples metadata:
	PyArrow has a binary type which allows for having audio bytes in Arrow tables. Therefore, it enables saving the dataset as one single Parquet file containing both the audio (bytes and path) and the samples metadata:

	Setting the Audio type in the Arrow schema metadata allows other libraries and the Hugging Face Dataset Viewer to know that "audio" contains audios and not just binary data.
	Setting the Audio type in the Arrow schema metadata enables other libraries and the Hugging Face Dataset Viewer to recognise that "audio" contains audio data, not just binary data.


		To load a file from Hugging Face, the path needs to start with `hf://`. For example, the path to the [stanfordnlp/imdb](https://huggingface.co/datasets/stanfordnlp/imdb) dataset repository is `hf://datasets/stanfordnlp/imdb`. The dataset on Hugging Face contains multiple Parquet files. The Parquet file format is designed to make reading and writing data frames efficient, and to make sharing data across data analysis languages easy. Here is how to load the file `plain_text/train-00000-of-00001.parquet` as a pyarrow Table (it requires `pyarrow>=21.0`):

		```python

[Datasets] Add PyArrow docs #1839

Are you sure you want to change the base?

[Datasets] Add PyArrow docs #1839

Uh oh!

Conversation

lhoestq commented Jul 15, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Jul 15, 2025

Uh oh!

julien-c commented Jul 15, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davanstrien left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!