- GPT2-Nepali is a GPT2-model pretrained on a 12.5GB Nepali dataset from the NepBERTa project [1].
This directory contains scripts for preprocessing the NepBERTa dataset:
- cleaning
- pre-tokenizing:
- Data preparation:
context_length = stride = 512
This directory includes tools and scripts for:
- Training a custom tokenizer for the Nepali dataset.
- Visualizing and analyzing token distributions.
-
This directory contains the core code for:
-
Training the GPT2 model on the Nepali dataset.
-
Running inference with the trained model.
Note: Most of the code in this section is adapted from the book: Build a Large Language Model (From Scratch) by Sebastian Raschka and the corresponding GitHub repository: LLMs-from-scratch.
- multi-GPU training (pytorch DDP)
- Use bigger training data and larger model size.
-
Npberta
-
github: rasbt/LLMs-from-scratch
-
github: karpathy/nanoGPT
-
Other Nepali Language Models: