Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 55 additions & 0 deletions records/track_10min_16mb/2026-04-05_David_MoE-Bigram4096/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# [10min/16mb] David Ghazaryan — MoE + BigramHash4096

**val_bpb: 1.1180** (3-seed mean) | 8×H100 SXM | 600s

## Results

| Seed | val_bpb | Artifact (bytes) |
|------|---------|-----------------|
| 1337 | 1.11764880 | 15,873,596 |
| 42 | 1.11891002 | 15,893,104 |
| 2025 | 1.11742168 | 15,908,116 |
| **mean** | **1.11799350** | 15,891,605 |

## Novel Contributions

### 1. BigramHash 4096
Expanded bigram hash table from SOTA's 3072 to 4096 buckets.
Provides richer local context signal at the embedding stage.

### 2. Mixture-of-Experts MLP (first in this repo)
Replaces standard MLP with 4 experts + top-2 routing.
Same active parameters but adds expert specialisation.

## Architecture

| Component | Setting | Introduced by |
|-----------|---------|---------------|
| Layers | 11 (512d, 8 heads, 4 KV) | Baseline |
| MLP | 3x LeakyReLU(0.5)² | PR #493 |
| XSA | All 11 layers | PR #478 |
| EMA | decay=0.997 | PR #374 |
| Partial RoPE | 16/64 dims | PR #315 |
| LN Scale | 1/√(layer+1) | PR #315 |
| GPTQ | Full Hessian AR self-gen | PR #1019 |
| BigramHash | 4096 buckets dim=96 | This PR |
| MoE MLP | 4 experts top-2 | This PR |
| Compression | int6 + lzma | PR #414 |

## Requirements

```bash
pip install sentencepiece zstandard
pip install flash_attn_3 --find-links \
https://windreamer.github.io/flash-attention3-wheels/cu128_torch291
```

## Run Command

```bash
BIGRAM_VOCAB_SIZE=4096 BIGRAM_DIM=96 WARMDOWN_ITERS=4000 TARGET_MB=15.9 SEED=1337 \
torchrun --standalone --nproc_per_node=8 train_gpt.py
```

## Hardware
8× H100 80GB HBM3 (YSU HPC Cluster, YerevaNN/Eleveight AI Program)
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
{
"val_bpb": 1.1180,
"val_bpb_std": 0.000619,
"val_bpb_exact": "1.11799350",
"seeds": [1337, 42, 2025],
"per_seed_val_bpb": [
"1.11764880",
"1.11891002",
"1.11742168"
],
"per_seed_artifact_bytes": [
15873596,
15893104,
15908116
],
"artifact_bytes_mean": 15891605,
"hardware": "8xH100 SXM",
"training_time_seconds": 600,
"description": "BigramHash4096 + MoE MLP — first MoE exploration in repo",
"author": "davie2009kh",
"date": "2026-04-05"
}
Loading