Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0

aryanbhosale · 2026-04-03T11:26:19Z

val_bpb = 1.0897 (3-seed mean, std 0.0003) | ~15.99 MB | 8×H100 SXM

3-Seed Results

Seed	Sliding BPB	Artifact
42	1.0894	15,999,165
314	1.0898	15,997,318
999	1.0899	15,990,607
Mean	1.0897

Merged SOTA (PR #1019): 1.1147 BPB. Delta: −0.0250 BPB.

Key Techniques

4096-Vocab + MLP 4x + WD 0.090 — PR Record: 4096-Vocab + 4.0-MLP-mult + 0.085-WD + Simplifications — val_bpb 1.09785 (3-seed mean) #1218 @clarkkev, PR Record: MuonEq-R + Depth Recurrence + WD=0.090 + All-Int6 GPTQ — val_bpb 1.0912 (3-seed mean) #1285 @dexhunter
Depth Recurrence (layers 4,5) — PR Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA #1204 @msisovic, PR Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean) #1260 @dexhunter
Parallel Residuals (from layer 7) — PR Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA #1204 @msisovic, PR Record: PROTEUS v1.6 — Scylla + Parallel Residuals + Depth Recurrence + Legal TTT — val_bpb 1.0819 (3-seed mean) #1289 @MatoTeziTanka
MuonEq-R — arXiv:2603.28254, PR Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean) #1260 @dexhunter
QK-Gain 5.0 — PR Non Record: MuonEq-R + Context-Only SLOT + QK_GAIN=5.0 — val_bpb 1.1027 (3-seed mean) #1217 @bigbag
Full GPTQ int6 + Brotli + Compressed Wrapper — LZMA self-extracting (~25KB code)

Compliance

No TTT, no SLOT, no n-gram cache, no eval-time adaptation
GPTQ calibration within training budget
All four conditions from Issue A Field Guide to Valid Submissions #1017 satisfied

Reproduction

pip install brotli
MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp4096 --skip-manifest
SEED=42 RECUR_LAYERS=4,5 RECUR_START_STEP=3000 PARALLEL_START_LAYER=7 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Credits

…0940 (3-seed mean) 4096-vocab + MLP 4x + WD 0.090 + depth recurrence (layers 4,5) + MuonEq-R + full GPTQ int6 + brotli + selective pruning. 3-seed mean: 1.0940 BPB, beating merged SOTA (PR openai#1019, 1.1147 BPB) by 0.0208 BPB.

…d mean) LZMA self-extracting code wrapper (24KB vs 81KB) frees 57KB for model precision. No pruning needed. 3-seed mean improves from 1.0940 to 1.0926.

Added parallel residuals from layer 7+ (separate attn/MLP lanes). 3-seed mean improves from 1.0926 to 1.0904.

QK-Gain from 4.0 to 5.0 plus parallel residuals and depth recurrence. 3-seed mean: 1.0897 BPB (std 0.0003), delta -0.0250 vs merged SOTA.

Port depth recurrence from PR openai#1290 and parallel residuals from PR openai#1296. - Depth recurrence: layers 3,4 repeated in forward pass via virtual layer mapping - Parallel residuals: attn+mlp computed in parallel from layer 6 onward - Configurable via RECUR_LAYERS, RECUR_START_STEP, PARALLEL_START_LAYER env vars

Ports parallel residuals from PR openai#1296 to openai#1290 base: - Block.__init__ accepts parallel flag - Block.forward() computes attn+mlp in parallel when parallel=True - GPT.__init__ passes parallel_start_layer to Block constructors - Layers 7-10 run parallel, layers 0-6 sequential (default PARALLEL_START_LAYER=7) - Both base_model and eval_model wired up

- QK_GAIN_INIT: 1.5 -> 5.0 (matches openai#1296 proven config) - WARMDOWN_ITERS: already 4000 (matches openai#1290 run command) - MULTIRES_ENABLED: 1 -> 0 (multi-res failed: only 1.13x speedup) - BIGRAM: revert to 2048x128 (3072x112 exceeded 16MB artifact limit)

Decompressed PR openai#1296 codebase (SP4096 + depth recurrence + MuonEq-R + parallel residuals + QK5 + GPTQ + brotli, 1.0897 BPB). Plan: port Helix cross-injection onto their architecture, add loop-aware GPTQ. Their depth recurrence + our helix quant shielding = novel combo. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

train_gpt_base.py — untouched PR openai#1296 decompressed source train_gpt_helix.py — same + Helix crawler block, cross-injection, merge gate 4 test arms: base, helix dim=64, helix dim=192, helix without recurrence. Tests whether Helix improves the field's SOTA recursion approach. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

MatoTeziTanka · 2026-04-11T20:04:16Z

Community Review — Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + Full GPTQ — val_bpb 1.0904 (3-seed mean)

BPB: 1.0904 | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1416/#1423 pattern)

What I found in the code (head SHA acdf503bc32e, file records/track_10min_16mb/2026-04-03_SP4096_DepthRecurrence_MuonEqR_GPTQ/train_gpt.py):

The TTT path at line 1521 implements the score-first-per-chunk pattern: each chunk is scored under torch.no_grad() / inference_mode() before the base_model.train() + SGD adaptation runs on that same chunk, with an is_last_chunk guard so the final chunk gets no adaptation pass. This is the structural shape the legal frontier uses (PRs #1416 erichroepke, #1423 aryanbhosale).

Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk ci is scored under weights adapted only on chunks 0..ci-1. No prequant_ttt_adapt_adamw(val_tokens, ...) multi-epoch fine-tune, no scored-region SLOT, no target-in-key n-gram cache.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 5.18s, dim=512, layers=11, vocab=4096, code=24584 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass.

Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 5.18s, dim=512, layers=11, vocab=4096, code=24584 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

aryanbhosale added 2 commits April 3, 2026 16:55

Update: compressed wrapper + improved results — val_bpb 1.0926 (3-see…

d2388de

…d mean) LZMA self-extracting code wrapper (24KB vs 81KB) frees 57KB for model precision. No pruning needed. 3-seed mean improves from 1.0940 to 1.0926.

aryanbhosale changed the title ~~Record: SP4096 + Depth Recurrence + MuonEq-R + Full GPTQ — val_bpb 1.0940 (3-seed mean)~~ Record: SP4096 + Depth Recurrence + MuonEq-R + Full GPTQ — val_bpb 1.0926 (3-seed mean) Apr 3, 2026

Update: add parallel residuals — val_bpb 1.0904 (3-seed mean)

02340cd

Added parallel residuals from layer 7+ (separate attn/MLP lanes). 3-seed mean improves from 1.0926 to 1.0904.

aryanbhosale changed the title ~~Record: SP4096 + Depth Recurrence + MuonEq-R + Full GPTQ — val_bpb 1.0926 (3-seed mean)~~ Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + Full GPTQ — val_bpb 1.0904 (3-seed mean) Apr 3, 2026

Update: QK-Gain 5.0 + TTT fix — val_bpb 1.0897 (3-seed mean)

acdf503

QK-Gain from 4.0 to 5.0 plus parallel residuals and depth recurrence. 3-seed mean: 1.0897 BPB (std 0.0003), delta -0.0250 vs merged SOTA.

aryanbhosale changed the title ~~Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + Full GPTQ — val_bpb 1.0904 (3-seed mean)~~ Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0 — val_bpb 1.0897 (3-seed mean) Apr 3, 2026

aryanbhosale mentioned this pull request Apr 3, 2026

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

himanshudongre mentioned this pull request Apr 4, 2026

Non-Record: TTT and GPTQ Are Fundamentally Incompatible — Quantized Weight Structure Defeats Test-Time Adaptation #1341

Open

resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 5, 2026

Re-apply CompTrain microbenchmark on new base (openai#1296 SP4096)

5357f7f

newjordan mentioned this pull request Apr 11, 2026

Recursive Transformer - Non-Record Submission — 1.07424983 val_bpb (4h depth-recurrent hybrid transformer run) #1535

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0 — val_bpb 1.0897 (3-seed mean)#1296