Document HDF5 support (#7740)

klamike · lhoestq · web-flow · commit 5dc1a179783d · 2025-09-24T16:51:11.000+02:00
* init docs

* update

* Update loading_methods.mdx

---------

Co-authored-by: Quentin Lhoest &lt;42851186+lhoestq@users.noreply.github.com&gt;
diff --git a/README.md b/README.md
@@ -21,7 +21,7 @@
 🤗 Datasets is a lightweight library providing **two** main features:
 
 - **one-line dataloaders for many public datasets**: one-liners to download and pre-process any of the ![number of datasets](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen) major public datasets (image datasets, audio datasets, text datasets in 467 languages and dialects, etc.) provided on the [HuggingFace Datasets Hub](https://huggingface.co/datasets). With a simple command like `squad_dataset = load_dataset("rajpurkar/squad")`, get any of these datasets ready to use in a dataloader for training/evaluating a ML model (Numpy/Pandas/PyTorch/TensorFlow/JAX),
-- **efficient data pre-processing**: simple, fast and reproducible data pre-processing for the public datasets as well as your own local datasets in CSV, JSON, text, PNG, JPEG, WAV, MP3, Parquet, etc. With simple commands like `processed_dataset = dataset.map(process_example)`, efficiently prepare the dataset for inspection and ML model evaluation and training.
+- **efficient data pre-processing**: simple, fast and reproducible data pre-processing for the public datasets as well as your own local datasets in CSV, JSON, text, PNG, JPEG, WAV, MP3, Parquet, HDF5, etc. With simple commands like `processed_dataset = dataset.map(process_example)`, efficiently prepare the dataset for inspection and ML model evaluation and training.
 
 [🎓 **Documentation**](https://huggingface.co/docs/datasets/) [🔎 **Find a dataset in the Hub**](https://huggingface.co/datasets) [🌟 **Share a dataset on the Hub**](https://huggingface.co/docs/datasets/share)
 
diff --git a/docs/source/loading.mdx b/docs/source/loading.mdx
@@ -178,6 +178,17 @@ The cache directory to store intermediate processing results will be the Arrow f
 
 For now only the Arrow streaming format is supported. The Arrow IPC file format (also known as Feather V2) is not supported.
 
+## HDF5 files
+
+[HDF5](https://www.hdfgroup.org/solutions/hdf5/) files are commonly used for storing large amounts of numerical data in scientific computing and machine learning. Loading HDF5 files with 🤗 Datasets is similar to loading CSV files:
+
+```py
+>>> from datasets import load_dataset
+>>> dataset = load_dataset("hdf5", data_files="data.h5")
+```
+
+Note that the HDF5 loader assumes that the file has "tabular" structure, i.e. that all datasets in the file have (the same number of) rows on their first dimension.
+
 ### SQL
 
 Read database contents with [`~datasets.Dataset.from_sql`] by specifying the URI to connect to your database. You can read both table names and queries:
diff --git a/docs/source/package_reference/loading_methods.mdx b/docs/source/package_reference/loading_methods.mdx
@@ -91,6 +91,12 @@ load_dataset("csv", data_dir="path/to/data/dir", sep="\t")
 
 [[autodoc]] datasets.packaged_modules.videofolder.VideoFolder
 
+### HDF5
+
+[[autodoc]] datasets.packaged_modules.hdf5.HDF5Config
+
+[[autodoc]] datasets.packaged_modules.hdf5.HDF5
+
 ### Pdf
 
 [[autodoc]] datasets.packaged_modules.pdffolder.PdfFolderConfig
diff --git a/docs/source/tabular_load.mdx b/docs/source/tabular_load.mdx
@@ -4,6 +4,7 @@ A tabular dataset is a generic dataset used to describe any data stored in rows
 
 - CSV files
 - Pandas DataFrames
+- HDF5 files
 - Databases
 
 ## CSV files
@@ -63,6 +64,17 @@ Use the `splits` parameter to specify the name of the dataset split:
 
 If the dataset doesn't look as expected, you should explicitly [specify your dataset features](loading#specify-features). A [pandas.Series](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) may not always carry enough information for Arrow to automatically infer a data type. For example, if a DataFrame is of length `0` or if the Series only contains `None/NaN` objects, the type is set to `null`.
 
+## HDF5 files
+
+[HDF5](https://www.hdfgroup.org/solutions/hdf5/) files are commonly used for storing large amounts of numerical data in scientific computing and machine learning. Loading HDF5 files with 🤗 Datasets is similar to loading CSV files:
+
+```py
+>>> from datasets import load_dataset
+>>> dataset = load_dataset("hdf5", data_files="data.h5")
+```
+
+Note that the HDF5 loader assumes that the file has "tabular" structure, i.e. that all datasets in the file have (the same number of) rows on their first dimension.
+
 ## Databases
 
 Datasets stored in databases are typically accessed with SQL queries. With 🤗 Datasets, you can connect to a database, query for the data you need, and create a dataset out of it. Then you can use all the processing features of 🤗 Datasets to prepare your dataset for training.