Skip to content

Allow for training on custom and tiktoken tokenizer #8

@danbraunai

Description

@danbraunai

EDIT: See comments in thread for updates

Currently the script hardcodes loading the gpt2 tokenizer, and loads the dataset from file. We'll want to be able to allow for loading different tokenizers and datasets from huggingface.

In general I think we'll need to support:

  1. The dataset and tokenizer will be hosted on huggingface.
  2. The pre-tokenized dataset will be hosted on huggingface (so we don't have to tokenize it on the fly everytime we train).

I think we can just get away with using huggingface's load_dataset with streaming=True. An example is here, which supports loading tokenized or untokenized datasets. Then we would just need to set it up to work for DDP. Not sure of the easiest way, there's probably standard setups here, maybe using a distributed sampler.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions