You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.
Huggingface Datasets has nicely gathered popularity over the last few months and it has a very simple API for accessing the most common NLP datasets. In addition, it has the potential to support multi-modal datasets as well (see related issue). At the moment, AllenNLP integrates datasets by downloading them manually and by reporting in the configuration file the path to the dataset. This scenario works most of the time but doesn't guarantee complete transparency in the training process.
Based on this issue, I was considering whether it would be possible to support this library so that AllenNLP can potentially take advantage of their caching functionalities as well. I'm aware that AllenNLP has its own way of handling tokenization and indexing but I still believe having a common entry point for dataset creation would be very handy as well as very elegant from the reproducibility point of view.
Any thoughts about this idea?
Thanks,
Alessandro
The text was updated successfully, but these errors were encountered:
It would be great to add support for Datasets. I was thinking about this a while ago and then it kind of fell off the map. I'm not sure yet how we'd integrate it, but I'm thinking it would either be through a new DatasetReader or DataLoader class that wraps it.
I'm trying to add a DataSetReader which can generically make instances from the huggingface dataset interface.
It will have limitations and may not work for all datasets in which case, a child can be added for it with selective overrides to take care of the missing gaps.
@epwalsh Raised a draft PR with slightly unbaked but functional code we did last week with @divijbajaj. It should give a rough direction. Would appreciate a high-level review if time permits.
Huggingface Datasets has nicely gathered popularity over the last few months and it has a very simple API for accessing the most common NLP datasets. In addition, it has the potential to support multi-modal datasets as well (see related issue). At the moment, AllenNLP integrates datasets by downloading them manually and by reporting in the configuration file the path to the dataset. This scenario works most of the time but doesn't guarantee complete transparency in the training process.
Based on this issue, I was considering whether it would be possible to support this library so that AllenNLP can potentially take advantage of their caching functionalities as well. I'm aware that AllenNLP has its own way of handling tokenization and indexing but I still believe having a common entry point for dataset creation would be very handy as well as very elegant from the reproducibility point of view.
Any thoughts about this idea?
Thanks,
Alessandro
The text was updated successfully, but these errors were encountered: