Skip to content

Aananda-giri/GPT2-Nepali

Repository files navigation

GPT2-Nepali

  • GPT2-Nepali is a GPT2-model pretrained on a 12.5GB Nepali dataset from the NepBERTa project [1].

1_preprocessing:

This directory contains scripts for preprocessing the NepBERTa dataset:

  • cleaning
  • pre-tokenizing:
  • Data preparation: context_length = stride = 512

2_tokenizer

This directory includes tools and scripts for:

  • Training a custom tokenizer for the Nepali dataset.
  • Visualizing and analyzing token distributions.

3_GPT2-Nepali

  • This directory contains the core code for:

  • Training the GPT2 model on the Nepali dataset.

  • Running inference with the trained model.

    Note: Most of the code in this section is adapted from the book: Build a Large Language Model (From Scratch) by Sebastian Raschka and the corresponding GitHub repository: LLMs-from-scratch.

Todo

  • multi-GPU training (pytorch DDP)
  • Use bigger training data and larger model size.

References

  1. Npberta

  2. Book: build-a-large-language-model-from-scratch

  3. github: rasbt/LLMs-from-scratch

  4. github: karpathy/nanoGPT

  5. Other Nepali Language Models: