Integration with Huggingface datasets #4962

aleSuglia · 2021-02-05T17:34:45Z

Huggingface Datasets has nicely gathered popularity over the last few months and it has a very simple API for accessing the most common NLP datasets. In addition, it has the potential to support multi-modal datasets as well (see related issue). At the moment, AllenNLP integrates datasets by downloading them manually and by reporting in the configuration file the path to the dataset. This scenario works most of the time but doesn't guarantee complete transparency in the training process.

Based on this issue, I was considering whether it would be possible to support this library so that AllenNLP can potentially take advantage of their caching functionalities as well. I'm aware that AllenNLP has its own way of handling tokenization and indexing but I still believe having a common entry point for dataset creation would be very handy as well as very elegant from the reproducibility point of view.

Any thoughts about this idea?

Thanks,
Alessandro

epwalsh · 2021-02-05T18:03:01Z

It would be great to add support for Datasets. I was thinking about this a while ago and then it kind of fell off the map. I'm not sure yet how we'd integrate it, but I'm thinking it would either be through a new DatasetReader or DataLoader class that wraps it.

dirkgr · 2021-02-12T00:42:47Z

Same for TensorFlow Datasets. TFDS datasets have a schema, so we could automatically read it into TextField, LabelField, and so on.

divijbajaj · 2021-03-31T11:46:43Z

I'm trying to add a DataSetReader which can generically make instances from the huggingface dataset interface.
It will have limitations and may not work for all datasets in which case, a child can be added for it with selective overrides to take care of the missing gaps.

epwalsh · 2021-03-31T16:21:55Z

@divijbajaj great! Looking forward to seeing what you come up with.

ghost · 2021-04-04T17:18:04Z

@epwalsh Raised a draft PR with slightly unbaked but functional code we did last week with @divijbajaj. It should give a rough direction. Would appreciate a high-level review if time permits.

aleSuglia added the Feature request label Feb 5, 2021

dirkgr modified the milestones: 1.4, 2.1 Feb 12, 2021

dirkgr modified the milestones: 2.1, 2.2 Feb 22, 2021

ghost linked a pull request Apr 4, 2021 that will close this issue

Add HuggingfaceDatasetReader for using Huggingface datasets #5095

Open

dirkgr mentioned this issue May 10, 2021

Adds Huggingface Dataset Reader #5194

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integration with Huggingface datasets #4962

Integration with Huggingface datasets #4962

aleSuglia commented Feb 5, 2021

epwalsh commented Feb 5, 2021

dirkgr commented Feb 12, 2021

divijbajaj commented Mar 31, 2021

epwalsh commented Mar 31, 2021

ghost commented Apr 4, 2021

Integration with Huggingface datasets #4962

Integration with Huggingface datasets #4962

Comments

aleSuglia commented Feb 5, 2021

epwalsh commented Feb 5, 2021

dirkgr commented Feb 12, 2021

divijbajaj commented Mar 31, 2021

epwalsh commented Mar 31, 2021

ghost commented Apr 4, 2021