- Did this project for learning, but hope others can also benefit from seeing a simple implementation.
- This repository contains a script for multi-GPU training GPT2 on a subset of FineWeb-Edu and a notebook for loading and streaming from the resulting model.
- How to run this code (Linux and MacOS are supported):
- Install pixi if you haven't already:
- On Linux:
curl -fsSL https://pixi.sh/install.sh | sh
- On MacOS: TODO
- On Linux:
- Install pixi if you haven't already:
Then restart your terminal, clone this repo, navigate into it and activate the pixi environment:
git clone https://github.com/JaHeRoth/reimplementing.git
cd reimplementing
pixi install
pixi shell
Finally, run run.py
like you would any other python script:
python run.py
- Learnings:
- Exact network architectures of GPT and GPT2 (down to the level of every individual nn.Parameter)
- Inner workings of AdamW optimizer
- LLM sampling tricks (and implementing temperature and nucleus sampling)
- Sequence packing
- Using HuggingFace tokenizers and datasets
- The PyTorch stack
- GPU tricks (kernel fusion through torch.compile, optimizing tensor sizes)
- Mixed-precision training
- Distributed training with DistributedDataParallel
- Sources: Attention is all you need, GPT paper, GPT2 paper, GELU paper, AdamW paper, weight tying paper (1608.05859v3), nucleus sampling paper (1904.09751v2), Karpathy's lectures and tutorials, etc?