Allow for training on custom and tiktoken tokenizer

EDIT: See comments in thread for updates

Currently the script hardcodes loading the gpt2 tokenizer, and loads the dataset from file. We'll want to be able to allow for loading different tokenizers and datasets from huggingface.

In general I think we'll need to support:
1. The dataset and tokenizer will be hosted on huggingface.
2. The pre-tokenized dataset will be hosted on huggingface (so we don't have to tokenize it on the fly everytime we train).

I think we can just get away with using huggingface's load_dataset with streaming=True. An example is [here](https://github.com/ApolloResearch/e2e_sae/blob/83b4add4d2652a4e5c6775a0e2b752897c871a87/e2e_sae/data.py#L134), which supports loading tokenized or untokenized datasets. Then we would just need to set it up to work for DDP. Not sure of the easiest way, there's probably standard setups here, maybe using a [distributed sampler](https://pytorch.org/docs/stable/data.html#torch.utils.data.distributed.DistributedSampler).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow for training on custom and tiktoken tokenizer #8

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Allow for training on custom and tiktoken tokenizer #8

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions