This repository contains a Jupyter Notebook (makGPT.ipynb) implementing the makGPT project. The notebook includes code, explanations, and examples for training, evaluating, and interacting with a GPT-based model. This project implement Transformer–style (decoder only) language model for macedonian language.
This notebook walks through preprocessing scraped text data, building a simple bigram Transformer–style language model in PyTorch, training it, and generating new text. Each section is broken into smaller chunks with explanations.
- Load standard Python libraries for regex and counting
- Import PyTorch and functional API
- Read in the raw scraped text file
- Remove URLs to shrink the vocabulary
- Strip out any disallowed characters (keeping Cyrillic letters, punctuation, digits, and newlines)
- Collapse multiple spaces (but preserve blank lines)
This standardizes the input before tokenization.
- Extract the sorted list of unique characters (our character-level vocabulary)
- Compute vocabulary size (# unique characters)
- Also do a quick word‐level frequency check for the top/bottom 10 words
- Set random seed for reproducibility
- Define batch size, context window (
block_size), learning parameters, model dimensions - Prepare
stoi/itosmaps and split data into train/validation tensors
get_batch(split)samples random context windows for training/validationestimate_loss()runs multiple eval batches without gradient updates
Define the attention head, multi-head mechanism, feed-forward block, and the residual Transformer block.
- Token + position embeddings
- Stack of Transformer blocks
- Final linear head for next‐token logits
- Loss calculation when targets provided
- Sampling method for text generation
- Create the model, move it to device
- Count parameters
- Set up AdamW optimizer
- For each iteration, optionally evaluate loss on train/val
- Sample a batch, compute loss, backpropagate, and step the optimizer
- Seed with an empty context and generate 2,000 new tokens
- Decode indices back to characters
- Clone the repository.
- Install dependencies.
- Open
makGPT.ipynbin Jupyter Notebook or JupyterLab. - Run cells sequentially to train and test the model.
