This repository has been archived by the owner on Dec 16, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Add HuggingfaceDatasetReader for using Huggingface datasets #5095
Open
ghost
wants to merge
67
commits into
allenai:main
Choose a base branch
from
Abhishek-P:datasets_feature
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
67 commits
Select commit
Hold shift + click to select a range
af9661e
Add `HuggingfaceDatasetReader` for using Huggingface `datasets`
Abhishek-P 8370803
Move mapping to funcs, remove preload support
Abhishek-P d5b8f3f
Support for Sequence Nesting
Abhishek-P 49fa0bc
Misc Fixes
Abhishek-P 17cd4ac
Misc check
Abhishek-P 5159f69
Comments
Abhishek-P 7155a32
map funcs _ prefix
Abhishek-P eb4b573
Parameters rename and cleanup
Abhishek-P a9ef475
Apply suggestions from code review by Dirk - comment text
92d95f5
Merge branch 'main' into datasets_feature
dirkgr 0e441da
Merge branch 'main' into datasets_feature
dirkgr 2610df8
Formatting
dirkgr a0d1408
Comments addressed
Abhishek-P 57b6f9e
Formatting
Abhishek-P e841b6e
removed invalid conll test
Abhishek-P 2497b24
Regression Fix
Abhishek-P a6718f4
Merge branch 'allenai:main' into datasets_feature
74931dc
Add float mapping to TensorField
Abhishek-P 10dd3e6
Verification tests
Abhishek-P f3e54dd
Attempt to Support Dict
Abhishek-P d0f31c1
Quick changes
Abhishek-P b277534
Dictionary works with SQUAD
Abhishek-P a1d9bca
Bias Mitigation and Direction Methods (#5130)
ArjunSubramonian 5dce9f5
Bias Metrics (#5139)
ArjunSubramonian dfed580
Update transformers requirement from <4.6,>=4.1 to >=4.1,<4.7 (#5199)
dependabot[bot] f1a1adc
Rename sanity_checks to confidence_checks (#5201)
AkshitaB 047ae34
Changes and improvements to how we initialize transformer modules fro…
epwalsh 0ea9225
Add a `min_steps` parameter to `BeamSearch` (#5207)
danieldeutsch 9de5b4e
Implementing abstraction to score final sequences in `BeamSearch` (#5…
danieldeutsch 5660670
added shuffle disable option in BucketBatchSampler (#5212)
ArjunSubramonian 73e570b
save meta data with model archives (#5209)
epwalsh f3aeeeb
Formatting
dirkgr d6c7769
Comments addressed
Abhishek-P 79f58a8
Formatting
Abhishek-P a55a7ba
removed invalid conll test
Abhishek-P 81d0409
Regression Fix
Abhishek-P 5b9e0c2
Bump black from 20.8b1 to 21.5b1 (#5195)
dependabot[bot] 66f226b
Update nr-interface requirement from <0.0.4 to <0.0.6 (#5213)
dependabot[bot] 3295bd5
Fix W&B callback for distributed training (#5223)
epwalsh 19d2a87
cancel redundant GH Actions workflows (#5226)
epwalsh 51a01fe
fix race condition when extracting files with cached_path (#5227)
epwalsh 7727af5
Bump checklist from 0.0.10 to 0.0.11 (#5222)
dependabot[bot] 0d5b88f
Added `DataCollator` for dynamic operations for each batch. (#5221)
wlhgtc b75c60c
Roll backbone (#5229)
jacob-morrison fd0981c
Fixes Checkpointing (#5220)
dirkgr 804fd59
Emergency fix. I forgot to take this out.
dirkgr deeec84
Add constraints to beam search (#5216)
danieldeutsch 0bdee9d
Make BeamSearch Registrable (#5231)
JohnGiorgi 8e10f69
tick version for nightly release
epwalsh 7b8e9e9
Generalize T5 modules (#5166)
AkshitaB 3916cf3
Fix tqdm logging into multiple files with allennlp-optuna (#5235)
MagiaSN 4753906
Checklist fixes (#5239)
AkshitaB b7a62fa
Contextualized bias mitigation (#5176)
ArjunSubramonian 1159432
Prepare for release v2.5.0
epwalsh 5f76b59
tick version for nightly release
epwalsh 044e0ff
Bump black from 21.5b1 to 21.5b2 (#5236)
dependabot[bot] b7fd842
[Docs] Fixes broken link in Fairness_Metrics (#5245)
bhadreshpsavani 38c930b
Ensure all relevant allennlp submodules are imported with `import_plu…
epwalsh 0e3a225
added `on_backward` trainer callback (#5249)
ArjunSubramonian 69d05ff
Add float mapping to TensorField
Abhishek-P 356b383
Verification tests
Abhishek-P 3192d70
Attempt to Support Dict
Abhishek-P e32c5b0
Quick changes
Abhishek-P 5f702ef
Dictionary works with SQUAD
Abhishek-P fd95128
Merge branch 'datasets_feature' of github.com:Abhishek-P/allennlp int…
Abhishek-P af029b3
Fix typing issues
Abhishek-P 41b7034
Works for Mocha, although may need to add specific handling for SQUAD…
Abhishek-P File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -45,6 +45,7 @@ __pycache__ | |
.coverage | ||
.pytest_cache/ | ||
.benchmarks | ||
htmlcov/ | ||
|
||
# documentation build artifacts | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
337 changes: 337 additions & 0 deletions
337
allennlp/data/dataset_readers/huggingface_datasets_reader.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,337 @@ | ||
from allennlp.data import DatasetReader, Token, Field, Tokenizer | ||
from allennlp.data.fields import TextField, LabelField, ListField, TensorField | ||
from allennlp.data.instance import Instance | ||
from datasets import load_dataset, DatasetDict, list_datasets | ||
from datasets.features import ( | ||
ClassLabel, | ||
Sequence, | ||
Translation, | ||
TranslationVariableLanguages, | ||
Value, | ||
FeatureType, | ||
) | ||
|
||
import torch | ||
from typing import Iterable, Optional, Dict, List, Union | ||
|
||
|
||
@DatasetReader.register("huggingface-datasets") | ||
class HuggingfaceDatasetReader(DatasetReader): | ||
""" | ||
Reads instances from the given huggingface supported dataset | ||
|
||
This reader implementation wraps the huggingface datasets package | ||
|
||
Registered as a `DatasetReader` with name `huggingface-datasets` | ||
|
||
# Parameters | ||
dataset_name : `str` | ||
Name of the dataset from huggingface datasets the reader will be used for. | ||
config_name : `str`, optional (default=`None`) | ||
Configuration(mandatory for some datasets) of the dataset. | ||
tokenizer : `Tokenizer`, optional (default=`None`) | ||
If specified is used for tokenization of string and text fields from the dataset. | ||
This is useful since text in allennlp is dealt with as a series of tokens. | ||
""" | ||
|
||
def __init__( | ||
self, | ||
dataset_name: str = None, | ||
config_name: Optional[str] = None, | ||
tokenizer: Optional[Tokenizer] = None, | ||
**kwargs, | ||
) -> None: | ||
super().__init__( | ||
manual_distributed_sharding=True, | ||
manual_multiprocess_sharding=True, | ||
**kwargs, | ||
) | ||
|
||
# It would be cleaner to create a separate reader object for each different dataset | ||
if dataset_name not in list_datasets(): | ||
raise ValueError(f"Dataset {dataset_name} not available in huggingface datasets") | ||
self.dataset: DatasetDict = DatasetDict() | ||
self.dataset_name = dataset_name | ||
self.config_name = config_name | ||
self.tokenizer = tokenizer | ||
|
||
self.features = None | ||
|
||
def load_dataset_split(self, split: str): | ||
if self.config_name is not None: | ||
self.dataset[split] = load_dataset(self.dataset_name, self.config_name, split=split) | ||
else: | ||
self.dataset[split] = load_dataset(self.dataset_name, split=split) | ||
|
||
def _read(self, file_path: str) -> Iterable[Instance]: | ||
""" | ||
Reads the dataset and converts the entry to AllenNLP friendly instance | ||
""" | ||
if file_path is None: | ||
raise ValueError("parameter split cannot be None") | ||
This conversation was marked as resolved.
Show resolved
Hide resolved
|
||
|
||
# If split is not loaded, load the specific split | ||
if file_path not in self.dataset: | ||
self.load_dataset_split(file_path) | ||
if self.features is None: | ||
self.features = self.dataset[file_path].features | ||
|
||
# TODO see if use of Dataset.select() is better | ||
This conversation was marked as resolved.
Show resolved
Hide resolved
|
||
dataset_split = self.dataset[file_path] | ||
for index in self.shard_iterable(range(len(dataset_split))): | ||
yield self.text_to_instance(file_path, dataset_split[index]) | ||
|
||
def raise_feature_not_supported_value_error(feature_name, feature_type): | ||
raise ValueError( | ||
f"Datasets feature {feature_name} type {feature_type} is not supported yet." | ||
) | ||
|
||
def text_to_instance(self, split: str, entry) -> Instance: # type: ignore | ||
""" | ||
Takes care of converting dataset entry into AllenNLP friendly instance | ||
|
||
Currently this is how datasets.features types are mapped to AllenNLP Fields | ||
|
||
dataset.feature type allennlp.data.fields | ||
`ClassLabel` `LabelField` in feature name namespace | ||
`Value.string` `TextField` with value as Token | ||
`Value.*` `LabelField` with value being label in feature name namespace | ||
`Translation` `ListField` of 2 ListField (ClassLabel and TextField) | ||
`TranslationVariableLanguages` `ListField` of 2 ListField (ClassLabel and TextField) | ||
`Sequence` `ListField` of sub-types | ||
Comment on lines
+95
to
+101
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is a proper Markdown table syntax: https://www.markdownguide.org/extended-syntax/ @epwalsh, do we support that syntax when the docs are built? |
||
""" | ||
|
||
# features indicate the different information available in each entry from dataset | ||
# feature types decide what type of information they are | ||
# e.g. In a Sentiment dataset an entry could have one feature (of type text/string) indicating the text | ||
# and another indicate the sentiment (of type int32/ClassLabel) | ||
|
||
features: Dict[str, FeatureType] = self.dataset[split].features | ||
This conversation was marked as resolved.
Show resolved
Hide resolved
|
||
fields: Dict[str, Field] = dict() | ||
|
||
# TODO we need to support all different datasets features described | ||
# in https://huggingface.co/docs/datasets/features.html | ||
This conversation was marked as resolved.
Show resolved
Hide resolved
This conversation was marked as resolved.
Show resolved
Hide resolved
|
||
for feature_name in features: | ||
item_field: Field | ||
field_list: list | ||
Comment on lines
+115
to
+116
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These are unused? |
||
feature_type = features[feature_name] | ||
|
||
fields_to_be_added = _map_Feature( | ||
feature_name, entry[feature_name], feature_type, self.tokenizer | ||
) | ||
for field_key in fields_to_be_added: | ||
fields[field_key] = fields_to_be_added[field_key] | ||
|
||
return Instance(fields) | ||
|
||
|
||
# Feature Mappers - These functions map a FeatureType into Fields | ||
def _map_Feature( | ||
feature_name: str, value, feature_type, tokenizer: Optional[Tokenizer] | ||
) -> Dict[str, Field]: | ||
fields_to_be_added: Dict[str, Field] = dict() | ||
if isinstance(feature_type, ClassLabel): | ||
fields_to_be_added[feature_name] = _map_ClassLabel(feature_name, value) | ||
# datasets Value can be of different types | ||
elif isinstance(feature_type, Value): | ||
fields_to_be_added[feature_name] = _map_Value(feature_name, value, feature_type, tokenizer) | ||
|
||
elif isinstance(feature_type, Sequence): | ||
if type(feature_type.feature) == dict: | ||
fields_to_be_added = _map_Dict(feature_type.feature, value, tokenizer, feature_name) | ||
else: | ||
fields_to_be_added[feature_name] = _map_Sequence( | ||
feature_name, value, feature_type.feature, tokenizer | ||
) | ||
|
||
elif isinstance(feature_type, Translation): | ||
fields_to_be_added = _map_Translation(feature_name, value, feature_type, tokenizer) | ||
|
||
elif isinstance(feature_type, TranslationVariableLanguages): | ||
fields_to_be_added = _map_TranslationVariableLanguages( | ||
feature_name, value, feature_type, tokenizer | ||
) | ||
|
||
elif isinstance(feature_type, dict): | ||
fields_to_be_added = _map_Dict(feature_type, value, tokenizer) | ||
else: | ||
raise ValueError(f"Datasets feature type {type(feature_type)} is not supported yet.") | ||
return fields_to_be_added | ||
|
||
|
||
def _map_ClassLabel(feature_name: str, value: ClassLabel) -> Field: | ||
field: Field = _map_to_Label(feature_name, value, skip_indexing=True) | ||
return field | ||
|
||
|
||
def _map_Value( | ||
feature_name: str, value: Value, feature_type, tokenizer: Optional[Tokenizer] | ||
) -> Union[TextField, LabelField, TensorField]: | ||
field: Union[TextField, LabelField, TensorField] | ||
if feature_type.dtype == "string": | ||
# datasets.Value[string] maps to TextField | ||
# If tokenizer is provided we will use it to split it to tokens | ||
# Else put whole text as a single token | ||
field = _map_String(value, tokenizer) | ||
|
||
elif feature_type.dtype == "float32" or feature_type.dtype == "float64": | ||
field = _map_Float(value) | ||
|
||
else: | ||
field = LabelField(value, label_namespace=feature_name, skip_indexing=True) | ||
return field | ||
|
||
|
||
def _map_Sequence( | ||
feature_name, value: Sequence, item_feature_type, tokenizer: Optional[Tokenizer] | ||
) -> ListField: | ||
field_list: List[Field] = list() | ||
field: ListField = list() | ||
item_field: Field | ||
# In HF Sequence and list are considered interchangeable, but there are some distinctions such as | ||
if isinstance(item_feature_type, Value): | ||
for item in value: | ||
# If tokenizer is provided we will use it to split it to tokens | ||
# Else put whole text as a single token | ||
item_field = _map_Value(feature_name, item, item_feature_type, tokenizer) | ||
field_list.append(item_field) | ||
if len(field_list) > 0: | ||
field = ListField(field_list) | ||
|
||
# datasets Sequence of strings to ListField of LabelField | ||
elif isinstance(item_feature_type, str): | ||
for item in value: | ||
# If tokenizer is provided we will use it to split it to tokens | ||
# Else put whole text as a single token | ||
item_field = _map_Value(feature_name, item, item_feature_type, tokenizer) | ||
field_list.append(item_field) | ||
if len(field_list) > 0: | ||
field = ListField(field_list) | ||
|
||
elif isinstance(item_feature_type, ClassLabel): | ||
for item in value: | ||
item_field = _map_to_Label(feature_name, item, skip_indexing=True) | ||
field_list.append(item_field) | ||
|
||
if len(field_list) > 0: | ||
field = ListField(field_list) | ||
|
||
elif isinstance(item_feature_type, Sequence): | ||
for item in value: | ||
item_field = _map_Sequence(value.feature, item, item_feature_type.feature, tokenizer) | ||
field_list.append(item_field) | ||
|
||
if len(field_list) > 0: | ||
field = ListField(field_list) | ||
|
||
# # WIP for dropx` | ||
# elif isinstance(item_feature_type, dict): | ||
# for item in value: | ||
# item_field = _map_Dict(item_feature_type, value[item], tokenizer) | ||
# field_list.append(item_field) | ||
# if len(field_list) > 0: | ||
# field = ListField(field_list) | ||
|
||
else: | ||
HuggingfaceDatasetReader.raise_feature_not_supported_value_error( | ||
feature_name, item_feature_type | ||
) | ||
|
||
return field | ||
|
||
|
||
def _map_Translation( | ||
feature_name: str, value: Translation, feature_type, tokenizer: Optional[Tokenizer] | ||
) -> Dict[str, Field]: | ||
fields: Dict[str, Field] = dict() | ||
if feature_type.dtype == "dict": | ||
input_dict = value | ||
langs = list(input_dict.keys()) | ||
texts = list() | ||
for lang in langs: | ||
if tokenizer is not None: | ||
tokens = tokenizer.tokenize(input_dict[lang]) | ||
|
||
else: | ||
tokens = [Token(input_dict[lang])] | ||
texts.append(TextField(tokens)) | ||
|
||
fields[feature_name + "-languages"] = ListField( | ||
[ | ||
_map_to_Label(feature_name + "-languages", lang, skip_indexing=False) | ||
for lang in langs | ||
] | ||
) | ||
fields[feature_name + "-texts"] = ListField(texts) | ||
|
||
else: | ||
raise ValueError(f"Datasets feature type {type(feature_type)} is not supported yet.") | ||
|
||
return fields | ||
|
||
|
||
def _map_TranslationVariableLanguages( | ||
feature_name: str, | ||
value: TranslationVariableLanguages, | ||
feature_type, | ||
tokenizer: Optional[Tokenizer], | ||
) -> Dict[str, Field]: | ||
fields: Dict[str, Field] = dict() | ||
if feature_type.dtype == "dict": | ||
input_dict = value | ||
fields[feature_name + "-language"] = ListField( | ||
[ | ||
_map_to_Label(feature_name + "-languages", lang, skip_indexing=False) | ||
for lang in input_dict["language"] | ||
] | ||
) | ||
|
||
if tokenizer is not None: | ||
fields[feature_name + "-translation"] = ListField( | ||
[TextField(tokenizer.tokenize(text)) for text in input_dict["translation"]] | ||
) | ||
else: | ||
fields[feature_name + "-translation"] = ListField( | ||
[TextField([Token(text)]) for text in input_dict["translation"]] | ||
) | ||
|
||
else: | ||
raise ValueError(f"Datasets feature type {type(value)} is not supported yet.") | ||
|
||
return fields | ||
|
||
|
||
# value mapper - Maps a single text value to TextField | ||
def _map_String(text: str, tokenizer: Optional[Tokenizer]) -> TextField: | ||
field: TextField | ||
if tokenizer is not None: | ||
field = TextField(tokenizer.tokenize(text)) | ||
else: | ||
field = TextField([Token(text)]) | ||
return field | ||
|
||
|
||
def _map_Float(value: float) -> TensorField: | ||
return TensorField(torch.tensor(value)) | ||
|
||
|
||
# value mapper - Maps a single value to a LabelField | ||
def _map_to_Label(namespace, item, skip_indexing=True) -> LabelField: | ||
return LabelField(label=item, label_namespace=namespace, skip_indexing=skip_indexing) | ||
|
||
|
||
def _map_Dict( | ||
feature_definition: dict, | ||
values: dict, | ||
tokenizer: Optional[Tokenizer] = None, | ||
feature_name: Optional[str] = None, | ||
) -> Dict[str, Field]: | ||
# TODO abhishek-p expand this to more generic based on metadata checks | ||
# Map it as a Dictionary of List | ||
fields: Dict[str, Field] = dict() | ||
for key in values: | ||
key_name: str = key | ||
if feature_name is not None: | ||
key_name = feature_name + "-" + key | ||
fields[key_name] = _map_Sequence(key, values[key], feature_definition[key], tokenizer) | ||
return fields |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is basically a cache, so if you load the same dataset twice it doesn't load it twice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit confused by this, because Huggingface already has their own cache. So we're caching it twice. Calling
datasets.load_dataset("squad", split="train")
takes about 200ms on my machine once all the files are downloaded. That's not a lot of time to save with a cache.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a programming perspective having this dict is cleaner. Even datasets lib gives you a
DatasetDict
And this is not a cache, since the reference is still to the same dataset object given by the datasets lib.
When a split is loaded it is a dataset, for the reader to maintain the organization of splits, I am using this DatasetDict.