GPT2-Nepali

GPT2-Nepali is a GPT2-model pretrained on a 12.5GB Nepali dataset from the NepBERTa project [1].

`1_preprocessing`:

This directory contains scripts for preprocessing the NepBERTa dataset:

cleaning
pre-tokenizing:
Data preparation: context_length = stride = 512

`2_tokenizer`

This directory includes tools and scripts for:

Training a custom tokenizer for the Nepali dataset.
Visualizing and analyzing token distributions.

`3_GPT2-Nepali`

This directory contains the core code for:
Training the GPT2 model on the Nepali dataset.
Running inference with the trained model.

Note: Most of the code in this section is adapted from the book: Build a Large Language Model (From Scratch) by Sebastian Raschka and the corresponding GitHub repository: LLMs-from-scratch.

Todo

multi-GPU training (pytorch DDP)
Use bigger training data and larger model size.

References

Npberta
- https://nepberta.github.io/
- Training data
Book: build-a-large-language-model-from-scratch
github: rasbt/LLMs-from-scratch
github: karpathy/nanoGPT
Other Nepali Language Models:

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
0. dataset		0. dataset
1. preprocessing		1. preprocessing
2. tokenizer		2. tokenizer
3. GPT2-Nepali		3. GPT2-Nepali
.gitignore		.gitignore
README.md		README.md
losses.png		losses.png
training_own_llm.ipynb		training_own_llm.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPT2-Nepali

`1_preprocessing`:

`2_tokenizer`

`3_GPT2-Nepali`

Todo

References

About

Releases 2

Packages

Languages

Aananda-giri/GPT2-Nepali

Folders and files

Latest commit

History

Repository files navigation

GPT2-Nepali

1_preprocessing:

2_tokenizer

3_GPT2-Nepali

Todo

References

About

Resources

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

`1_preprocessing`:

`2_tokenizer`

`3_GPT2-Nepali`

Packages