Record: Vocab4096 + MLP4.0x + WD0.085 - val_bpb 1.1048 (3-seed mean) by dentity007 · Pull Request #1287 · openai/parameter-golf

dentity007 · 2026-04-03T05:43:07Z

Record: Vocab 4096 + MLP 4.0x + High WD + Simplifications

val_bpb: 1.1048 (3-seed mean, std 0.0008) | ~15.95 MB | 8xH100 SXM | No TTT, No SLOT

3-Seed Results

Seed	Steps	Pre-quant BPB	Sliding BPB	Artifact
42	4,807	1.1109	1.1039	15,946,451
1337	4,701	1.1127	1.1054	15,929,221
2025	4,758	1.1124	1.1052	15,959,609
Mean			1.1048 (std 0.0008)

Merged SOTA (PR #1019): 1.1147 BPB (1.8822 nats).
This submission: 1.1048 BPB (~1.8656 nats).
Delta: -0.0166 nats (-0.0099 BPB). Clears the 0.005-nat threshold by 3.3x.

Key Changes vs SOTA (#1019)

Vocab 4096 (up from 1024) with sp4096 tokenizer from kevclark/parameter-golf
MLP 4.0x expansion (up from 3.0x)
Weight decay 0.085 on Muon and embeddings (discovery: weight RMS correlates with compressibility at R^2 ~0.99)
Byte shuffle + brotli-11 compression (saves ~400KB vs LZMA)
Dynamic warmdown 66.7% of actual steps
Removed: BigramHash, SmearGate, Value Residuals, Gated Attention, QAT, TTT

Legality

No test-time training (TTT)
No SLOT (eval-time delta optimization)
No n-gram cache
GPTQ calibration uses AR self-generated tokens only
Standard F.cross_entropy with full normalized distributions
Single-pass sliding window evaluation (stride=64)

Architecture

11L transformer, d=512, 8H/4KV GQA, MLP 4.0x, XSA all layers, QK_GAIN=4.0, EMA 0.997, sigmoid-gated U-Net skips, coprime-stride data loader. 34.4M params.

Credits

PR #1218 (@clarkkev), PR #1019 (@abaybektursun), PR #1089, PR #1125, PR #726

Reproduction

pip install sentencepiece zstandard brotli
pip install flash_attn_3 --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291
rm -f data/manifest.json
MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp4096 --train-shards 143
SEED=42 torchrun --standalone --nproc_per_node=8 train_gpt.py

Test Plan

3 seeds verified (std 0.0008, p < 0.01)
All artifacts under 16,000,000 bytes
Training under 600s, eval under 600s
No TTT, no SLOT, no n-gram cache
GPTQ calibration within training budget

…er optimization, and SSM exploration

…3168) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

…no TTT/SLOT)

dentity007 and others added 4 commits March 30, 2026 19:12

Add approach notes for parameter golf challenge

ad23b7f

Update approach with depth recurrence, factorized embeddings, tokeniz…

300eb5c

…er optimization, and SSM exploration

Non-record: Mamba-Inspired SSM Hybrid (3:1 SSM:Attention) (val_bpb 3.…

6e23d7d

…3168) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Record: Vocab4096 + MLP4.0x + WD0.085 - val_bpb 1.1048 (3-seed mean, …

7c64c90

…no TTT/SLOT)

dentity007 mentioned this pull request Apr 6, 2026

Non-record: PROTEUS Feature Ablation - Parallel Residuals + Mixed INT5/INT6 + TTT on DGX Spark GB10 #1425

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Vocab4096 + MLP4.0x + WD0.085 - val_bpb 1.1048 (3-seed mean)#1287

Record: Vocab4096 + MLP4.0x + WD0.085 - val_bpb 1.1048 (3-seed mean)#1287
dentity007 wants to merge 4 commits intoopenai:mainfrom
NathanMaine:record/vocab4096-mlp4x-wd085-1.1048

dentity007 commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dentity007 commented Apr 3, 2026

Record: Vocab 4096 + MLP 4.0x + High WD + Simplifications

3-Seed Results

Key Changes vs SOTA (#1019)

Legality

Architecture

Credits

Reproduction

Test Plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant