Non-Record: BPB 1.1334 — 7000-Step Training + Mixed Int6/Int8 Quantization + Legal TTT by Christopher-Lee-McClendon · Pull Request #598 · openai/parameter-golf

Christopher-Lee-McClendon · 2026-03-24T04:54:11Z

Non-Record: 4×A100-40GB

val_bpb = 1.1334 | Pre-TTT: 1.1476 | Artifact: 15.70 MB (headroom: 297 KB)

What Changed

Extended training to 7000 steps (from the typical 5200) with a longer warmdown cosine anneal (step 3500→7000), combined with mixed int6/int8 quantization to keep the 27M-parameter model under 16 MB. Legal score-first TTT (10 epochs SGD with momentum) yields a further −0.0142 BPB improvement.

	This Work	Prior VE128+RoPE (30ep TTT)
Pre-TTT BPB	1.1476	1.1609
Post-TTT BPB	1.1334	1.1425
Training steps	7,000	5,200
TTT epochs	10	30
Eval time	2,194s	3,662s
Artifact size	15.70 MB	15.65 MB

The base model improvement (−0.0133 pre-TTT) comes from longer training plus the GEPA architecture. Fewer TTT epochs (10 vs 30) mean faster eval (40% less wall time) at the cost of a smaller TTT gain (−0.0142 vs −0.0184).

Architecture: 11L GEPA

11 unique layers (no depth recurrence), d=512, 8Q/4KV GQA heads
ReLU² (Star-ReLU) activation in 3× MLP (1536 hidden)
Cross-sequence attention (XSA) on last 4 layers
Exponential moving average (decay 0.997)
Bigram hash embeddings (2048 buckets, 128d)
Partial RoPE (16/64 dims) with YARN scaling
Value embeddings (128d) on layers 9–10
U-Net skip connections across layer pairs
LN depth scaling (1/√(layer+1))
Late QAT with GPTQ-lite clip search (5 candidates/row), enabled at step 6476

Mixed Quantization

Dual-scheme compression for the 27M-parameter model:

Int6 per-row (GPTQ-lite): attention projections + MLP weights (bulk of params)
Int8 per-tensor (scalar scale): layer norms, value embeddings, biases, embedding tables

27.5 MB payload → 15.63 MB after zstd-22 (3.89× compression) + 76 KB code = 15.70 MB total.

TTT Protocol (Legal Score-First)

SGD with momentum (0.9) at lr=0.002, 10 epochs per 32K-token chunk, stride=64, freezing first 2 blocks. Score-first: every token scored under torch.inference_mode() before any weight update.

Limitations

Single seed (42) — no variance estimate. Acceptable for non-record but results may vary across seeds.
No ablation of individual GEPA components. The architecture combines multiple techniques without isolating their contributions.
No LeakyReLU: PR Non-Record: BPB 1.13872 — LeakyReLU(0.5)² + Per-Layer LR Legal TTT (3 seeds) #537 showed LeakyReLU(0.5)² helps (−0.0035 BPB). This submission uses standard ReLU² instead — combining with LeakyReLU is an obvious next step.

Credits

Muon optimizer — modded-nanogpt baseline (kellerjordan)
BigramHash + SmearGate — PR Record: Mixed Quant Int6/FP16 + SmearGate + OrthoInit + MLP 3x + Sliding Window, val_bpb=1.1556 #65 (aquariouseworkman)
XSA (Cross-Sequence Attention) — PR [Closed] EMA + Multi-Order N-gram Backoff + PE Confidence (val_bpb=0.9393) #187 (Idan3011); GQA-aware variant PR Record: 11L + Efficient Partial XSA (val_bpb: 1.1307) #265 (unnir)
U-Net skip connections + ReLU² + int6 quantization — PR SmearGate + BigramHash + Int6 + SWA + U-Net Skips (1.1518 BPB) #289 (integrate-your-mind)
Partial RoPE + LN depth scaling — PR Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248) #315 (jfprincz)
Shared Value Embeddings (VE128) — PR Record: 11L + Tight SWA + Shared VE128 + Partial RoPE + LN Scale + XSA4 (val_bpb: 1.1246) #374 (unnir)
EMA + GPTQ-lite clip search + warmdown 3500 + Late QAT@0.15 — PR Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233) #414 (signalrush)
LeakyReLU² (referenced in Limitations) — PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549 (abaybektursun)
Tied FP16 embeddings + warmdown — PR fp16 tied embedding + warmdown/LR tuning (val_bpb 1.2197) #42 (chonchiog)

And all contributors to the parameter-golf competition.

… BPB) - Non-record submission: 1.1334 BPB, 15.70 MB artifact (4×A100-40GB) - Mixed quantization: int6 per-row for MLP/attn, int8 per-tensor for rest - 7000 training steps (vs 5200 baseline) with GEPA architecture - Legal score-first TTT: SGD 10 epochs, -0.0142 BPB gain - Beats prior non-record best (1.1425) by 0.009 BPB

MatoTeziTanka · 2026-04-11T20:07:25Z

Community Review — Non-Record: BPB 1.1334 — 7000-Step Training + Mixed Int6/Int8 Quantization + Legal TTT

BPB: 1.1334 | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1416/#1423 pattern)

What I found in the code (head SHA f94b154a740f, file records/track_non_record_16mb/2026-03-24_11L_GEPA_MixedQuant_7kSteps_LegalTTT/train_gpt.py):

The TTT path at line 399 implements the score-first-per-chunk pattern: each chunk is scored under torch.no_grad() / inference_mode() before the base_model.train() + SGD adaptation runs on that same chunk, with an is_last_chunk guard so the final chunk gets no adaptation pass. This is the structural shape the legal frontier uses (PRs #1416 erichroepke, #1423 aryanbhosale).

Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk ci is scored under weights adapted only on chunks 0..ci-1. No prequant_ttt_adapt_adamw(val_tokens, ...) multi-epoch fine-tune, no scored-region SLOT, no target-in-key n-gram cache.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.06s, dim=512, layers=11, vocab=1024, code=77796 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass.

Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.06s, dim=512, layers=11, vocab=1024, code=77796 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

notapplica mentioned this pull request Mar 24, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-Record: BPB 1.1334 — 7000-Step Training + Mixed Int6/Int8 Quantization + Legal TTT#598

Non-Record: BPB 1.1334 — 7000-Step Training + Mixed Int6/Int8 Quantization + Legal TTT#598
Christopher-Lee-McClendon wants to merge 1 commit intoopenai:mainfrom
Christopher-Lee-McClendon:submission/11L-gepa-mixed-quant-7k-legal-ttt

Christopher-Lee-McClendon commented Mar 24, 2026 •

edited

Loading

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Christopher-Lee-McClendon commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Non-Record: 4×A100-40GB

What Changed

Architecture: 11L GEPA

Mixed Quantization

TTT Protocol (Legal Score-First)

Limitations

Credits

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — Non-Record: BPB 1.1334 — 7000-Step Training + Mixed Int6/Int8 Quantization + Legal TTT

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Christopher-Lee-McClendon commented Mar 24, 2026 •

edited

Loading