All models is train or fintune with embedding length(emb) = 1024 and context length(ctx) = 1024.
Model | Layer | Head | Params | Size | loss |
---|---|---|---|---|---|
GPT2-medium | 24 | 16 | 354M | 1.3GB | 3.026 |
Model | Layer | Head | Params | Size | loss | |
---|---|---|---|---|---|---|
Scratch | Distillation | |||||
GPT-student | 8 | 8 | 152M | 584MB | 4.9918 | 4.1676 |
- Causal language model loss (clm loss): Cross Entropy Loss
- Logits matching (ce loss): Kullback-Leibler Divergence Loss
- Hidden state matching (cosin loss): Cosine Embedding Loss
- GPT2-medium: Pre-trained
- GPT-student: Scratch - Distillation
pip install -r requirements.txt
cd data/shakespeare
python prepare.py
bash run_adamw/finetune_gpt2m.sh
bash run_adamw/train_student.sh
bash run_adamw/train_student_distill.sh