openai · dentity007 · Mar 18, 2026 · Mar 18, 2026 · Mar 31, 2026 · Apr 3, 2026
diff --git a/APPROACH.md b/APPROACH.md
@@ -0,0 +1,93 @@
+# Parameter Golf — Approach Notes
+
+## Strategy Overview
+
+Maximize language model quality within a 16MB artifact constraint and 10 minutes on 8×H100s. Five pillars informed by research in model compression, efficient architectures, and training optimization.
+
+---
+
+## 1. Depth Recurrence (Layer Sharing)
+
+Instead of unique parameters per layer, reuse a small set of transformer blocks recursively. A 4-block recursive model with 8 passes achieves the effective depth of a 32-layer network while only storing 4 layers of parameters.
+
+Research shows recursive transformers achieve comparable loss to standard architectures with 3-4× fewer parameters. The model learns to refine representations through repeated application of the same weights — a form of iterative refinement that naturally suits the extreme parameter constraint.
+
+**Target:** Replace 12 unique layers with 4 recursive blocks × 3 passes = 12 effective layers at 1/3 the parameter cost.
+
+## 2. Factorized Embeddings
+
+The embedding matrix is often the largest single component. Instead of a full V×H matrix, decompose it into V×E and E×H where E << H. This technique (from ALBERT) can reduce embedding parameters by 80%+ while maintaining representation quality.
+
+Combined with tied input/output embeddings, this eliminates the output projection layer entirely — the same factorized embedding serves both input and output.
+
+**Math:** At vocab 1024, hidden 512: Full = 524K params. Factorized (E=128): 131K + 65K = 196K params. Savings: 63%.
+
+## 3. Quantization-Aware Training (QAT)
+
+Train the model knowing it will be quantized. The model learns weight distributions that survive low-precision conversion. At 2-bit precision, 16MB supports ~32M parameters.
+
+Key insight: post-training quantization at 2-bit loses 15-20% quality. QAT at 2-bit loses only ~4%. The difference is massive at this scale.
+
+**Approach:** Train at FP16/BF16, apply QAT during training with straight-through estimators, export at 2-bit for the final artifact.
+
+## 4. Knowledge Distillation
+
+Use a larger pretrained model as a teacher during training. The 8×H100 budget can run a 7B teacher alongside a 32M student. The student learns from soft probability distributions rather than hard labels, capturing more knowledge per training step.
+
+Distillation is especially powerful for small models — the teacher provides a richer gradient signal than raw cross-entropy on token predictions alone.
+
+## 5. Training Maximization
+
+Every second of the 10-minute budget matters:
+
+- **Sequence packing:** Multiple short examples per input sequence, no wasted padding tokens
+- **Curriculum ordering:** Train on FineWeb examples ordered by difficulty (shorter/simpler first, longer/complex later) for faster initial convergence
+- **Cosine LR schedule:** High initial learning rate with cosine decay over the 10-minute window
+- **Gradient accumulation:** Effective batch size tuned for optimal loss curves on H100s
+- **Mixed precision training:** BF16 compute for speed, QAT checkpoints for artifact size
+
+## 6. Tokenizer Optimization
+
+Vocabulary size directly impacts embedding parameter count. The baseline uses 1024 tokens. Exploring:
+
+- Smaller BPE vocabularies (512, 256) — fewer embedding parameters but worse compression
+- The tradeoff is parameter cost vs bytes-per-token — the evaluation metric is bits per byte, so better compression from larger vocab can offset the parameter cost
+- Custom tokenizer trained specifically on FineWeb distribution
+
+## 7. Alternative Architectures
+
+Beyond standard transformers:
+
+- **State-space models (Mamba-style):** Linear scaling with sequence length, potentially more parameter-efficient for the same quality
+- **Mixture of Experts at micro-scale:** Multiple tiny FFN experts with a router — only a subset active per token, more capacity per parameter
+- **Depth-adaptive inference:** Early exit for easy tokens, full depth for hard ones — maximizes quality where it matters most
+
+---
+
+## The Math
+
+| Bitwidth | Parameters in 16MB | Architecture |
+|----------|-------------------|-------------|
+| 2-bit | ~32M | Recursive transformer, factorized embeddings |
+| 3-bit | ~21M | Standard transformer, tied embeddings |
+| 4-bit | ~16M | Compact transformer |
+
+## Experiment Plan
+
+- [ ] Run baseline (9-layer, 512-dim, 1024-vocab, tied embeddings) — establish score to beat (1.2244)
+- [ ] Implement depth recurrence (4 recursive blocks × 3 passes)
+- [ ] Add factorized embeddings (V×128 + 128×H)
+- [ ] Test 2-bit QAT during training
+- [ ] Knowledge distillation with 7B teacher
+- [ ] Curriculum data ordering on FineWeb
+- [ ] Tokenizer vocabulary sweep (256, 512, 1024, 2048)
+- [ ] Mamba/SSM architecture comparison
+- [ ] Combine best techniques into final submission
+
+## Background
+
+5 production fine-tuned models (7B-72B) deployed via QLoRA/GGUF/NVFP4 quantization on NVIDIA DGX hardware. Built a 130K-chunk expert knowledge base for AI/ML research consultation. Deep experience with compression-quality tradeoffs across bitwidths.
+
+## Status
+
+Credits requested. Local experimentation with MLX baseline in progress.
diff --git a/records/track_10min_16mb/2026-04-03_Vocab4096_MLPMult4_WD085_1.1048/README.md b/records/track_10min_16mb/2026-04-03_Vocab4096_MLPMult4_WD085_1.1048/README.md
@@ -0,0 +1,112 @@
+# Record: Vocab 4096 + MLP 4.0x + High WD + Simplifications
+
+**val_bpb: 1.1048** (3-seed mean, std 0.0008) | **~15.95 MB** | 8xH100 SXM, 590s | No TTT, No SLOT
+
+## Results (8xH100 80GB SXM, Montreal)
+
+| Seed | Steps | Pre-quant BPB | Post-quant BPB | **Sliding BPB** | Artifact |
+|------|-------|---------------|----------------|-----------------|----------|
+| 42 | 4,807 | 1.1109 | 1.1223 | **1.1039** | 15,946,451 |
+| 1337 | 4,701 | 1.1127 | 1.1238 | **1.1054** | 15,929,221 |
+| 2025 | 4,758 | 1.1124 | 1.1234 | **1.1052** | 15,959,609 |
+| **Mean** | **4,755** | **1.1120** | **1.1232** | **1.1048** | |
+| **Std** | | | | **0.0008** | |
+
+Merged SOTA (PR #1019, @abaybektursun): **1.1147 BPB** (1.8822 nats).
+This submission: **1.1048 BPB** (~1.8656 nats).
+Delta: **-0.0166 nats** (-0.0099 BPB). Clears the 0.005-nat threshold by 3.3x.
+
+## Overview
+
+This submission builds on PR #1218 (@clarkkev) with the same architecture run on our hardware. The key insight: a wider model (MLP 4.0x) with a larger vocabulary (4096) and aggressive weight decay (0.085) produces a more compressible model that fits under 16MB via brotli-11, while delivering better training quality per step than the narrower 1024-vocab architecture.
+
+## Architecture
+
+- 11 transformer layers, d=512, 8 attention heads, 4 KV heads (GQA)
+- MLP expansion 4.0x (up from 3.0x in SOTA)
+- Vocabulary size 4096 (up from 1024)
+- XSA (cross-sequence attention) on all 11 layers
+- QK_GAIN_INIT=4.0
+- EMA with decay 0.997
+- Sigmoid-gated U-Net skip connections
+- Coprime-stride data loader for better data diversity
+- 34.4M parameters (vs 27M for #1019)
+
+## What was removed (vs #1019 SOTA)
+
+- BigramHash embeddings
+- SmearGate
+- Value residuals
+- Gated attention
+- Quantization-aware training (QAT)
+- Test-time training (TTT)
+- Parameter banking
+- Distributed Muon (replaced with simple DDP Muon)
+
+## Training
+
+- Muon optimizer with weight decay 0.085 (up from 0.04)
+- Embeddings weight decay 0.085 (was 0)
+- Adam weight decay 0.02 (down from 0.04)
+- Learning rate 0.02 (down from 0.025)
+- Dynamic warmdown: 66.7% of actual training steps
+- Max wallclock: 600s (GPTQ reserves 10s, effective 590s)
+
+## Quantization and Compression
+
+- Full Hessian GPTQ with AR self-generated calibration data (no val or train data access)
+- Int6 quantization
+- Byte shuffle + brotli-11 compression (saves ~400KB vs LZMA)
+- All artifacts under 16,000,000 bytes
+
+## Causality and Legality
+
+- **No test-time training (TTT)**: No parameter updates during evaluation
+- **No SLOT**: No eval-time delta optimization
+- **No n-gram cache**: No eval-time frequency table construction
+- **No pre-eval adaptation**: GPTQ calibration uses AR self-generated tokens only
+- Standard sliding window evaluation with stride 64
+- F.cross_entropy scoring produces full normalized probability distributions
+
+## Key Insight: Weight Decay and Compressibility
+
+The compressibility of a weight matrix (quantized-and-compressed size / raw size) correlates with the matrix's root-mean-square value with R^2 near 0.99 (credit: @clarkkev PR #1218). Higher weight decay produces lower-magnitude weights that compress better, allowing a wider model to fit under the 16MB cap. This is why MLP 4.0x + WD 0.085 works where MLP 3.0x + WD 0.04 would not.
+
+## Tokenizer
+
+Uses the sp4096 SentencePiece tokenizer from kevclark/parameter-golf on HuggingFace. Larger vocab means more context per sequence and more training data processed per step, partially compensating for slower per-step throughput.
+
+## Reproduction
+
+```bash
+pip install sentencepiece zstandard brotli
+pip install flash_attn_3 --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291
+
+# Download sp4096 data
+rm -f data/manifest.json
+MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf \
+  python3 data/cached_challenge_fineweb.py --variant sp4096 --train-shards 143
+
+# Run (each seed)
+for SEED in 42 1337 2025; do
+  SEED=$SEED torchrun --standalone --nproc_per_node=8 train_gpt.py
+done
+```
+
+## Credits
+
+- PR #1218 (@clarkkev) for the architecture and key insights
+- PR #1019 (@abaybektursun) for the merged SOTA baseline
+- PR #1089 for sigmoid-gated U-Net skips and brotli compression
+- PR #1125 for QK_GAIN=4.0 sweep
+- PR #726 for coprime-stride data loader
+
+## Test Plan
+
+- [x] 3-seed verification (std 0.0008, p < 0.01 vs SOTA)
+- [x] All artifacts under 16,000,000 bytes
+- [x] Training under 600s per seed
+- [x] Evaluation under 600s per seed
+- [x] No TTT, no SLOT, no n-gram cache
+- [x] GPTQ calibration within training budget (AR self-gen)
+- [x] Standard F.cross_entropy scoring (full normalized distributions)
diff --git a/...rack_10min_16mb/2026-04-03_Vocab4096_MLPMult4_WD085_1.1048/run_1019_baseline_seed1337.log b/...rack_10min_16mb/2026-04-03_Vocab4096_MLPMult4_WD085_1.1048/run_1019_baseline_seed1337.log
@@ -0,0 +1,82 @@
+W0403 03:32:19.891000 60874 torch/distributed/run.py:803] 
+W0403 03:32:19.891000 60874 torch/distributed/run.py:803] *****************************************
+W0403 03:32:19.891000 60874 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
+W0403 03:32:19.891000 60874 torch/distributed/run.py:803] *****************************************
+logs/a1bb0e37-4271-484a-85fc-f4eac761cc6c.txt
+val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=./data/tokenizers/fineweb_1024_bpe.model
+train_loader:dataset:fineweb10B_sp1024 train_shards:80
+val_loader:shards pattern=./data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632
+model_params:26993756
+mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0
+XSA:last_11 active_layers:[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
+world_size:8 grad_accum_steps:1
+sdp_backends:cudnn=False flash=True mem_efficient=False math=False
+attention_mode:gqa num_heads:8 num_kv_heads:4
+tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025
+train_batch_tokens:786432 train_seq_len:2048 iterations:20000 warmup_steps:20 max_wallclock_seconds:600.000
+seed:1337
+warmup_step:1/20
+warmup_step:2/20
+warmup_step:3/20
+warmup_step:4/20
+warmup_step:5/20
+warmup_step:6/20
+warmup_step:7/20
+warmup_step:8/20
+warmup_step:9/20
+warmup_step:10/20
+warmup_step:11/20
+warmup_step:12/20
+warmup_step:13/20
+warmup_step:14/20
+warmup_step:15/20
+warmup_step:16/20
+warmup_step:17/20
+warmup_step:18/20
+warmup_step:19/20
+warmup_step:20/20
+step:0/20000 val_loss:6.9309 val_bpb:4.1049 train_time:0ms step_avg:0.02ms
+step:1/20000 train_loss:6.9317 train_time:156ms step_avg:155.91ms
+step:2/20000 train_loss:8.6728 train_time:196ms step_avg:98.10ms
+step:3/20000 train_loss:7.5836 train_time:289ms step_avg:96.31ms
+step:4/20000 train_loss:7.2659 train_time:387ms step_avg:96.86ms
+step:5/20000 train_loss:7.2140 train_time:483ms step_avg:96.63ms
+step:6/20000 train_loss:7.1334 train_time:580ms step_avg:96.62ms
+step:7/20000 train_loss:6.9767 train_time:679ms step_avg:97.05ms
+step:8/20000 train_loss:6.8479 train_time:771ms step_avg:96.34ms
+step:9/20000 train_loss:6.4730 train_time:872ms step_avg:96.84ms
+step:10/20000 train_loss:6.0761 train_time:974ms step_avg:97.39ms
+step:500/20000 train_loss:2.3801 train_time:54587ms step_avg:109.17ms
+step:1000/20000 train_loss:2.2592 train_time:111337ms step_avg:111.34ms
+step:1500/20000 train_loss:2.2013 train_time:169961ms step_avg:113.31ms
+step:2000/20000 train_loss:2.0398 train_time:229127ms step_avg:114.56ms
+step:2500/20000 train_loss:2.1340 train_time:287168ms step_avg:114.87ms
+step:3000/20000 train_loss:2.1161 train_time:347434ms step_avg:115.81ms
+step:3500/20000 train_loss:2.1209 train_time:405444ms step_avg:115.84ms
+step:4000/20000 train_loss:1.9076 train_time:462576ms step_avg:115.64ms
+step:4000/20000 val_loss:1.9965 val_bpb:1.1824 train_time:462643ms step_avg:115.66ms
+swa:start step:4500
+step:4500/20000 train_loss:2.0478 train_time:520375ms step_avg:115.64ms
+late_qat:enabled step:4676 scale:0.1498
+step:5000/20000 train_loss:2.0234 train_time:578962ms step_avg:115.79ms
+step:5197/20000 val_loss:1.9367 val_bpb:1.1470 train_time:600073ms step_avg:115.47ms
+stopping_early: wallclock_cap train_time:600073ms step:5197/20000
+peak memory allocated: 22850 MiB reserved: 23004 MiB
+ema:applying EMA weights
+DIAGNOSTIC post_ema val_loss:1.9354 val_bpb:1.1463 eval_time:2071ms
+Serialized model: 106158518 bytes
+Code size: 101850 bytes
+gptq:building non-banked model for Hessian collection...
+gptq:generating autoregressive calibration data (64 seqs x 2048 tokens, temp=0.8)...
+gptq:generated 64 sequences in 168.7s
+gptq:collecting hessians from autoregressive data...
+gptq:collected hessians for 68 layers (AR self-gen)
+selective_prune: 4154809 ±1 candidates, unpruned=15.14MB target=15.9MB
+selective_prune: already fits, no pruning needed
+Serialized model int6+lzma: 15771800 bytes
+Total submission size int6+lzma: 15873650 bytes
+final_int6_roundtrip val_loss:1.9420 val_bpb:1.1501 eval_time:6060ms
+final_int6_roundtrip_exact val_loss:1.94197068 val_bpb:1.15014442
+final_int6_sliding_window val_loss:1.9022 val_bpb:1.1266 stride:64 eval_time:76492ms
+final_int6_sliding_window_exact val_loss:1.90223733 val_bpb:1.12661508
+final_int8_zlib_roundtrip_exact val_loss:1.90223733 val_bpb:1.12661508
diff --git a/.../track_10min_16mb/2026-04-03_Vocab4096_MLPMult4_WD085_1.1048/run_1019_baseline_seed42.log b/.../track_10min_16mb/2026-04-03_Vocab4096_MLPMult4_WD085_1.1048/run_1019_baseline_seed42.log
@@ -0,0 +1,82 @@
+W0403 03:02:38.081000 1669 torch/distributed/run.py:803] 
+W0403 03:02:38.081000 1669 torch/distributed/run.py:803] *****************************************
+W0403 03:02:38.081000 1669 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
+W0403 03:02:38.081000 1669 torch/distributed/run.py:803] *****************************************
+logs/5a4b9b63-f7dc-435c-bd6b-5addaac83cba.txt
+val_bpb:enabled tokenizer_kind=sentencepiece tokenizer_path=./data/tokenizers/fineweb_1024_bpe.model
+train_loader:dataset:fineweb10B_sp1024 train_shards:80
+val_loader:shards pattern=./data/datasets/fineweb10B_sp1024/fineweb_val_*.bin tokens:62021632
+model_params:26993756
+mtp_num_heads:0 mtp_loss_weight:0.2 mtp_params:0
+XSA:last_11 active_layers:[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
+world_size:8 grad_accum_steps:1
+sdp_backends:cudnn=False flash=True mem_efficient=False math=False
+attention_mode:gqa num_heads:8 num_kv_heads:4
+tie_embeddings:True embed_lr:0.035 head_lr:0.0 matrix_lr:0.025 scalar_lr:0.025
+train_batch_tokens:786432 train_seq_len:2048 iterations:20000 warmup_steps:20 max_wallclock_seconds:600.000
+seed:42
+warmup_step:1/20
+warmup_step:2/20
+warmup_step:3/20
+warmup_step:4/20
+warmup_step:5/20
+warmup_step:6/20
+warmup_step:7/20
+warmup_step:8/20
+warmup_step:9/20
+warmup_step:10/20
+warmup_step:11/20
+warmup_step:12/20
+warmup_step:13/20
+warmup_step:14/20
+warmup_step:15/20
+warmup_step:16/20
+warmup_step:17/20
+warmup_step:18/20
+warmup_step:19/20
+warmup_step:20/20
+step:0/20000 val_loss:6.9297 val_bpb:4.1042 train_time:0ms step_avg:0.02ms
+step:1/20000 train_loss:6.9319 train_time:159ms step_avg:158.87ms
+step:2/20000 train_loss:8.6433 train_time:208ms step_avg:103.97ms
+step:3/20000 train_loss:7.6217 train_time:301ms step_avg:100.20ms
+step:4/20000 train_loss:7.2523 train_time:398ms step_avg:99.45ms
+step:5/20000 train_loss:7.1621 train_time:492ms step_avg:98.50ms
+step:6/20000 train_loss:7.0480 train_time:587ms step_avg:97.84ms
+step:7/20000 train_loss:6.9528 train_time:680ms step_avg:97.14ms
+step:8/20000 train_loss:6.9048 train_time:775ms step_avg:96.90ms
+step:9/20000 train_loss:6.5304 train_time:870ms step_avg:96.66ms
+step:10/20000 train_loss:6.1158 train_time:966ms step_avg:96.58ms
+step:500/20000 train_loss:2.4074 train_time:54011ms step_avg:108.02ms
+step:1000/20000 train_loss:2.2655 train_time:109964ms step_avg:109.96ms
+step:1500/20000 train_loss:2.2044 train_time:167944ms step_avg:111.96ms
+step:2000/20000 train_loss:2.0470 train_time:225077ms step_avg:112.54ms
+step:2500/20000 train_loss:2.1383 train_time:282633ms step_avg:113.05ms
+step:3000/20000 train_loss:2.1206 train_time:340916ms step_avg:113.64ms
+step:3500/20000 train_loss:2.1259 train_time:397735ms step_avg:113.64ms
+step:4000/20000 train_loss:1.9130 train_time:454567ms step_avg:113.64ms
+step:4000/20000 val_loss:2.0004 val_bpb:1.1848 train_time:454637ms step_avg:113.66ms
+step:4500/20000 train_loss:2.0516 train_time:511623ms step_avg:113.69ms
+swa:start step:4600
+late_qat:enabled step:4741 scale:0.1499
+step:5000/20000 train_loss:2.0260 train_time:570573ms step_avg:114.11ms
+step:5261/20000 val_loss:1.9367 val_bpb:1.1470 train_time:600083ms step_avg:114.06ms
+stopping_early: wallclock_cap train_time:600083ms step:5261/20000
+peak memory allocated: 22860 MiB reserved: 23042 MiB
+ema:applying EMA weights
+DIAGNOSTIC post_ema val_loss:1.9355 val_bpb:1.1463 eval_time:2067ms
+Serialized model: 106158518 bytes
+Code size: 101850 bytes
+gptq:building non-banked model for Hessian collection...
+gptq:generating autoregressive calibration data (64 seqs x 2048 tokens, temp=0.8)...
+gptq:generated 64 sequences in 174.1s
+gptq:collecting hessians from autoregressive data...
+gptq:collected hessians for 68 layers (AR self-gen)
+selective_prune: 4118621 ±1 candidates, unpruned=15.15MB target=15.9MB
+selective_prune: already fits, no pruning needed
+Serialized model int6+lzma: 15788548 bytes
+Total submission size int6+lzma: 15890398 bytes
+final_int6_roundtrip val_loss:1.9417 val_bpb:1.1500 eval_time:22116ms
+final_int6_roundtrip_exact val_loss:1.94173933 val_bpb:1.15000740
+final_int6_sliding_window val_loss:1.9020 val_bpb:1.1265 stride:64 eval_time:101939ms
+final_int6_sliding_window_exact val_loss:1.90197856 val_bpb:1.12646182
+final_int8_zlib_roundtrip_exact val_loss:1.90197856 val_bpb:1.12646182