makGPT

Overview

This repository contains a Jupyter Notebook (makGPT.ipynb) implementing the makGPT project. The notebook includes code, explanations, and examples for training, evaluating, and interacting with a GPT-based model. This project implement Transformer–style (decoder only) language model for macedonian language.

Bigram Language Model from Scratch

This notebook walks through preprocessing scraped text data, building a simple bigram Transformer–style language model in PyTorch, training it, and generating new text. Each section is broken into smaller chunks with explanations.

1. Imports and Initial Setup

Load standard Python libraries for regex and counting
Import PyTorch and functional API
Read in the raw scraped text file

2. Text Cleaning and Preprocessing

Remove URLs to shrink the vocabulary
Strip out any disallowed characters (keeping Cyrillic letters, punctuation, digits, and newlines)
Collapse multiple spaces (but preserve blank lines)

This standardizes the input before tokenization.

3. Vocabulary Analysis

Extract the sorted list of unique characters (our character-level vocabulary)
Compute vocabulary size (# unique characters)
Also do a quick word‐level frequency check for the top/bottom 10 words

4. Hyperparameters and Data Splits

Set random seed for reproducibility
Define batch size, context window (block_size), learning parameters, model dimensions
Prepare stoi/itos maps and split data into train/validation tensors

5. Batch Generation and Loss Estimation

get_batch(split) samples random context windows for training/validation
estimate_loss() runs multiple eval batches without gradient updates

6. Model Components

Define the attention head, multi-head mechanism, feed-forward block, and the residual Transformer block.

7. Bigram Language Model Definition

Token + position embeddings
Stack of Transformer blocks
Final linear head for next‐token logits
Loss calculation when targets provided
Sampling method for text generation

8. Model Instantiation and Optimizer

Create the model, move it to device
Count parameters
Set up AdamW optimizer

9. Training Loop

For each iteration, optionally evaluate loss on train/val
Sample a batch, compute loss, backpropagate, and step the optimizer

10. Text Generation

Seed with an empty context and generate 2,000 new tokens
Decode indices back to characters

11. Saving and Loading Model

How to Run

Clone the repository.
Install dependencies.
Open makGPT.ipynb in Jupyter Notebook or JupyterLab.
Run cells sequentially to train and test the model.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
decoder.png		decoder.png
makGPT.ipynb		makGPT.ipynb
model.pt		model.pt
scrapedata.txt		scrapedata.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

makGPT

Overview

Bigram Language Model from Scratch

1. Imports and Initial Setup

2. Text Cleaning and Preprocessing

3. Vocabulary Analysis

4. Hyperparameters and Data Splits

5. Batch Generation and Loss Estimation

6. Model Components

7. Bigram Language Model Definition

8. Model Instantiation and Optimizer

9. Training Loop

10. Text Generation

11. Saving and Loading Model

How to Run

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

makGPT

Overview

Bigram Language Model from Scratch

1. Imports and Initial Setup

2. Text Cleaning and Preprocessing

3. Vocabulary Analysis

4. Hyperparameters and Data Splits

5. Batch Generation and Loss Estimation

6. Model Components

7. Bigram Language Model Definition

8. Model Instantiation and Optimizer

9. Training Loop

10. Text Generation

11. Saving and Loading Model

How to Run

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages