generated from ApolloResearch/sample
-
Notifications
You must be signed in to change notification settings - Fork 5
Open
Labels
Description
EDIT: See comments in thread for updates
Currently the script hardcodes loading the gpt2 tokenizer, and loads the dataset from file. We'll want to be able to allow for loading different tokenizers and datasets from huggingface.
In general I think we'll need to support:
- The dataset and tokenizer will be hosted on huggingface.
- The pre-tokenized dataset will be hosted on huggingface (so we don't have to tokenize it on the fly everytime we train).
I think we can just get away with using huggingface's load_dataset with streaming=True. An example is here, which supports loading tokenized or untokenized datasets. Then we would just need to set it up to work for DDP. Not sure of the easiest way, there's probably standard setups here, maybe using a distributed sampler.
Reactions are currently unavailable