Skip to content

Commit 5dc1a17

Browse files
klamikelhoestq
andauthored
Document HDF5 support (#7740)
* init docs * update * Update loading_methods.mdx --------- Co-authored-by: Quentin Lhoest <[email protected]>
1 parent b5b1ba0 commit 5dc1a17

File tree

4 files changed

+30
-1
lines changed

4 files changed

+30
-1
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@
2121
🤗 Datasets is a lightweight library providing **two** main features:
2222

2323
- **one-line dataloaders for many public datasets**: one-liners to download and pre-process any of the ![number of datasets](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen) major public datasets (image datasets, audio datasets, text datasets in 467 languages and dialects, etc.) provided on the [HuggingFace Datasets Hub](https://huggingface.co/datasets). With a simple command like `squad_dataset = load_dataset("rajpurkar/squad")`, get any of these datasets ready to use in a dataloader for training/evaluating a ML model (Numpy/Pandas/PyTorch/TensorFlow/JAX),
24-
- **efficient data pre-processing**: simple, fast and reproducible data pre-processing for the public datasets as well as your own local datasets in CSV, JSON, text, PNG, JPEG, WAV, MP3, Parquet, etc. With simple commands like `processed_dataset = dataset.map(process_example)`, efficiently prepare the dataset for inspection and ML model evaluation and training.
24+
- **efficient data pre-processing**: simple, fast and reproducible data pre-processing for the public datasets as well as your own local datasets in CSV, JSON, text, PNG, JPEG, WAV, MP3, Parquet, HDF5, etc. With simple commands like `processed_dataset = dataset.map(process_example)`, efficiently prepare the dataset for inspection and ML model evaluation and training.
2525

2626
[🎓 **Documentation**](https://huggingface.co/docs/datasets/) [🔎 **Find a dataset in the Hub**](https://huggingface.co/datasets) [🌟 **Share a dataset on the Hub**](https://huggingface.co/docs/datasets/share)
2727

docs/source/loading.mdx

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -178,6 +178,17 @@ The cache directory to store intermediate processing results will be the Arrow f
178178

179179
For now only the Arrow streaming format is supported. The Arrow IPC file format (also known as Feather V2) is not supported.
180180

181+
## HDF5 files
182+
183+
[HDF5](https://www.hdfgroup.org/solutions/hdf5/) files are commonly used for storing large amounts of numerical data in scientific computing and machine learning. Loading HDF5 files with 🤗 Datasets is similar to loading CSV files:
184+
185+
```py
186+
>>> from datasets import load_dataset
187+
>>> dataset = load_dataset("hdf5", data_files="data.h5")
188+
```
189+
190+
Note that the HDF5 loader assumes that the file has "tabular" structure, i.e. that all datasets in the file have (the same number of) rows on their first dimension.
191+
181192
### SQL
182193

183194
Read database contents with [`~datasets.Dataset.from_sql`] by specifying the URI to connect to your database. You can read both table names and queries:

docs/source/package_reference/loading_methods.mdx

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -91,6 +91,12 @@ load_dataset("csv", data_dir="path/to/data/dir", sep="\t")
9191

9292
[[autodoc]] datasets.packaged_modules.videofolder.VideoFolder
9393

94+
### HDF5
95+
96+
[[autodoc]] datasets.packaged_modules.hdf5.HDF5Config
97+
98+
[[autodoc]] datasets.packaged_modules.hdf5.HDF5
99+
94100
### Pdf
95101

96102
[[autodoc]] datasets.packaged_modules.pdffolder.PdfFolderConfig

docs/source/tabular_load.mdx

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ A tabular dataset is a generic dataset used to describe any data stored in rows
44

55
- CSV files
66
- Pandas DataFrames
7+
- HDF5 files
78
- Databases
89

910
## CSV files
@@ -63,6 +64,17 @@ Use the `splits` parameter to specify the name of the dataset split:
6364

6465
If the dataset doesn't look as expected, you should explicitly [specify your dataset features](loading#specify-features). A [pandas.Series](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) may not always carry enough information for Arrow to automatically infer a data type. For example, if a DataFrame is of length `0` or if the Series only contains `None/NaN` objects, the type is set to `null`.
6566

67+
## HDF5 files
68+
69+
[HDF5](https://www.hdfgroup.org/solutions/hdf5/) files are commonly used for storing large amounts of numerical data in scientific computing and machine learning. Loading HDF5 files with 🤗 Datasets is similar to loading CSV files:
70+
71+
```py
72+
>>> from datasets import load_dataset
73+
>>> dataset = load_dataset("hdf5", data_files="data.h5")
74+
```
75+
76+
Note that the HDF5 loader assumes that the file has "tabular" structure, i.e. that all datasets in the file have (the same number of) rows on their first dimension.
77+
6678
## Databases
6779

6880
Datasets stored in databases are typically accessed with SQL queries. With 🤗 Datasets, you can connect to a database, query for the data you need, and create a dataset out of it. Then you can use all the processing features of 🤗 Datasets to prepare your dataset for training.

0 commit comments

Comments
 (0)