Skip to content

Smilkoski/makGPT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

makGPT

Overview

This repository contains a Jupyter Notebook (makGPT.ipynb) implementing the makGPT project. The notebook includes code, explanations, and examples for training, evaluating, and interacting with a GPT-based model. This project implement Transformer–style (decoder only) language model for macedonian language.

Description

Bigram Language Model from Scratch

This notebook walks through preprocessing scraped text data, building a simple bigram Transformer–style language model in PyTorch, training it, and generating new text. Each section is broken into smaller chunks with explanations.

1. Imports and Initial Setup

  • Load standard Python libraries for regex and counting
  • Import PyTorch and functional API
  • Read in the raw scraped text file

2. Text Cleaning and Preprocessing

  • Remove URLs to shrink the vocabulary
  • Strip out any disallowed characters (keeping Cyrillic letters, punctuation, digits, and newlines)
  • Collapse multiple spaces (but preserve blank lines)

This standardizes the input before tokenization.

3. Vocabulary Analysis

  • Extract the sorted list of unique characters (our character-level vocabulary)
  • Compute vocabulary size (# unique characters)
  • Also do a quick word‐level frequency check for the top/bottom 10 words

4. Hyperparameters and Data Splits

  • Set random seed for reproducibility
  • Define batch size, context window (block_size), learning parameters, model dimensions
  • Prepare stoi/itos maps and split data into train/validation tensors

5. Batch Generation and Loss Estimation

  • get_batch(split) samples random context windows for training/validation
  • estimate_loss() runs multiple eval batches without gradient updates

6. Model Components

Define the attention head, multi-head mechanism, feed-forward block, and the residual Transformer block.

7. Bigram Language Model Definition

  • Token + position embeddings
  • Stack of Transformer blocks
  • Final linear head for next‐token logits
  • Loss calculation when targets provided
  • Sampling method for text generation

8. Model Instantiation and Optimizer

  • Create the model, move it to device
  • Count parameters
  • Set up AdamW optimizer

9. Training Loop

  • For each iteration, optionally evaluate loss on train/val
  • Sample a batch, compute loss, backpropagate, and step the optimizer

10. Text Generation

  • Seed with an empty context and generate 2,000 new tokens
  • Decode indices back to characters

11. Saving and Loading Model

How to Run

  1. Clone the repository.
  2. Install dependencies.
  3. Open makGPT.ipynb in Jupyter Notebook or JupyterLab.
  4. Run cells sequentially to train and test the model.

About

GPT for macedonian language

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors