Skip to content

Record: Vocab4096 + MLP4.0x + WD0.085 - val_bpb 1.1048 (3-seed mean)#1287

Open
dentity007 wants to merge 4 commits intoopenai:mainfrom
NathanMaine:record/vocab4096-mlp4x-wd085-1.1048
Open

Record: Vocab4096 + MLP4.0x + WD0.085 - val_bpb 1.1048 (3-seed mean)#1287
dentity007 wants to merge 4 commits intoopenai:mainfrom
NathanMaine:record/vocab4096-mlp4x-wd085-1.1048

Conversation

@dentity007
Copy link
Copy Markdown

Record: Vocab 4096 + MLP 4.0x + High WD + Simplifications

val_bpb: 1.1048 (3-seed mean, std 0.0008) | ~15.95 MB | 8xH100 SXM | No TTT, No SLOT

3-Seed Results

Seed Steps Pre-quant BPB Sliding BPB Artifact
42 4,807 1.1109 1.1039 15,946,451
1337 4,701 1.1127 1.1054 15,929,221
2025 4,758 1.1124 1.1052 15,959,609
Mean 1.1048 (std 0.0008)

Merged SOTA (PR #1019): 1.1147 BPB (1.8822 nats).
This submission: 1.1048 BPB (~1.8656 nats).
Delta: -0.0166 nats (-0.0099 BPB). Clears the 0.005-nat threshold by 3.3x.

Key Changes vs SOTA (#1019)

  • Vocab 4096 (up from 1024) with sp4096 tokenizer from kevclark/parameter-golf
  • MLP 4.0x expansion (up from 3.0x)
  • Weight decay 0.085 on Muon and embeddings (discovery: weight RMS correlates with compressibility at R^2 ~0.99)
  • Byte shuffle + brotli-11 compression (saves ~400KB vs LZMA)
  • Dynamic warmdown 66.7% of actual steps
  • Removed: BigramHash, SmearGate, Value Residuals, Gated Attention, QAT, TTT

Legality

  • No test-time training (TTT)
  • No SLOT (eval-time delta optimization)
  • No n-gram cache
  • GPTQ calibration uses AR self-generated tokens only
  • Standard F.cross_entropy with full normalized distributions
  • Single-pass sliding window evaluation (stride=64)

Architecture

11L transformer, d=512, 8H/4KV GQA, MLP 4.0x, XSA all layers, QK_GAIN=4.0, EMA 0.997, sigmoid-gated U-Net skips, coprime-stride data loader. 34.4M params.

Credits

PR #1218 (@clarkkev), PR #1019 (@abaybektursun), PR #1089, PR #1125, PR #726

Reproduction

pip install sentencepiece zstandard brotli
pip install flash_attn_3 --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291
rm -f data/manifest.json
MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp4096 --train-shards 143
SEED=42 torchrun --standalone --nproc_per_node=8 train_gpt.py

Test Plan

  • 3 seeds verified (std 0.0008, p < 0.01)
  • All artifacts under 16,000,000 bytes
  • Training under 600s, eval under 600s
  • No TTT, no SLOT, no n-gram cache
  • GPTQ calibration within training budget

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant