Record: SP8192 + QK-Gain 5 + Legal Score-First TTT — val_bpb 1.08279 (3-seed mean) by dexhunter · Pull Request #1413 · openai/parameter-golf

dexhunter · 2026-04-06T11:28:11Z

Summary

On top of PR #1394 (@clarkkev) — the current clean sp8192 benchmark — this submission adds a single-knob QK_GAIN_INIT=5.0 and a legal score-first TTT eval pass (TTT_LR=0.005, 3 epochs, freeze=0).

val_bpb: 1.08279 (3-seed mean across seeds 0/42/1234) — 0.00731 nats/token below PR Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean) #1394 (1.08563), clearing the 0.005 nats record threshold by 0.00231 nats.
All 3 seeds fit 16 MB (margins 7,454–10,942 bytes)
Training 588 s / seed, eval 381–392 s / seed (well under the 600 s budgets)
No SLOT, no pre-quant TTT, no ETLB, no n-gram cache, no tokenizer change. Score-first TTT follows the PR #549 precedent — every chunk is scored under inference_mode() before any parameter update.

Hardware: 8×H100 80GB SXM, PyTorch 2.9.1+cu128. See the README in the new folder for the full two-table results + diagnostics layout per repo SUBMISSION_GUIDE.md.

Per-seed (post-TTT)

Seed	Pre-TTT sliding bpb	Post-TTT bpb	Δ TTT	Artifact	Train ms	Eval ms
0	1.08397	1.08210	−0.00187	15,991,018	588,004	385,050
42	1.08470	1.08315	−0.00155	15,992,546	588,009	381,500
1234	1.08590	1.08314	−0.00276	15,989,058	588,000	386,880
mean	1.08486	1.08279	−0.00206	15,990,874	588,004	384,477

Lineage / change from PR #1394

Same base stack as PR Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean) #1394: sp8192 BPE, 11L/512d/8H/4KV, MLP 4×, Partial RoPE 16d, depth recurrence (loop layers 4–5 twice from 50% training), MuonEq-R WD=0.085, full-Hessian GPTQ int6 + int8 embeddings + SD-clip, Brotli+byte-shuffle, EMA.
Two changes: (1) QK_GAIN_INIT raised from 4.0 → 5.0; (2) added a legal score-first TTT sliding pass (LR=0.005, 3 epochs, freeze_blocks=0) as an additional eval mode.

Compliance (Issue #1017 four conditions)

Condition 1 (Causality): Strict left-to-right causal model. Sliding eval never references future tokens.
Condition 2 (Normalized distribution): Standard softmax over full vocab. No logit biasing, no BigramHash, no two-pass.
Condition 3 (Score before update): Every TTT chunk is scored under torch.inference_mode() BEFORE any parameter update. Training on a chunk only happens AFTER its scoring has been accumulated into loss_sum. Matches the PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549 pattern.
Condition 4 (Single pass): Each token is scored exactly once.

Additional flags:

No SLOT (standard or causal). No eval-time delta optimization.
No pre-quant TTT on val data.
No n-gram cache at eval.
No tokenizer change — uses PR Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean) #1394's SentencePiece BPE 8192 unchanged.
Rule-checker (tools/verify_rules.py) passes all 3 seed logs with --frontier-bpp 1.08563 --merged-sota-nats 2.80428.

Reproduction

export NCCL_NET=Socket
export QK_GAIN_INIT=5.0
export TTT_ENABLED=1
export TTT_LR=0.005
export TTT_EPOCHS=3
for SEED in 0 42 1234; do
    SEED=$SEED torchrun --standalone --nproc_per_node=8 train_gpt.py
done

Credits

@clarkkev — PR Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean) #1394 sp8192 base stack (GPTQ embeddings, depth recurrence, MuonEq-R, SD-clip)
@abaybektursun — PR Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #1019 GPTQ-XSA lineage; PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549 legal score-first TTT precedent
@Christopher-Lee-McClendon — PR Non-record: 11L Depth Recurrence + High-Yield Legal TTT (1.14458 BPB) #461 LoRA TTT reference
@unnir — PR Record: 11L + Efficient Partial XSA (val_bpb: 1.1307) #265 XSA

Files

Only adds records/track_10min_16mb/2026-04-06_SP8192_QK5_LegalTTT_1.0828/ with README, submission.json, train_gpt.py, and 3 seed logs.

@clarkkev

…(3-seed mean) On PR openai#1394 (@clarkkev): added single-knob QK_GAIN_INIT=5.0 and a legal score-first TTT eval pass (TTT_LR=0.005, epochs=3, freeze=0) on top of the clean sp8192 base. Three independent seeds (0, 42, 1234) on 8xH100 SXM, all fitting 16MB with 7-11K margin. Per-seed (post-TTT): - seed 0 : 1.08210 (val_loss 2.79517) - seed 42 : 1.08315 (val_loss 2.79788) - seed 1234: 1.08314 (val_loss 2.79785) - mean : 1.08279 (2.79697 nats per token) Improvement vs PR openai#1394 (1.08563 mean): -0.00284 bpb = -0.00731 nats/token, clearing the 0.005 nats record threshold by 0.00231 nats per seed. No SLOT, no pre-quant TTT, no ETLB, no n-gram cache, no tokenizer change. Score-first TTT matches PR openai#549 precedent: every chunk scored under inference_mode() before any parameter update.

…freeze support

Changes: - num_loops: 2 -> 3, enable_looping_at: 0.5 -> 0.35 - Add score-first TTT eval (ported from PR openai#1413) - Novel twist: ttt_loop_only=1 freezes all except blocks 4-5 - TTT config: LR=0.005, epochs=3, SGD, chunk_tokens=32768 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@valerio-oai

…m tilt, SP8192 primary path - PR openai#771 confirmed CLOSED/REJECTED (train-then-score AdamW TTT) - PR openai#727 confirmed CLOSED (illegal n-gram hash cache) - Merged SOTA unchanged at 1.1147 - New primary target: PR openai#1420 (abaybektursun, 1.08014): SP8192 + Triple Loop (3×, 17 virtual layers) + N-gram Tilt (legal, properly normalized, -0.0029 bpb) + Fused Kernels (+127 steps) - PR openai#1413 (1.08279): confirms legal score-first TTT adds -0.003 bpb - ETLB (-0.0019 bpb) noted as unruled — await @valerio-oai - Strategy updated to v10.0: SP8192 + Triple Loop replaces SP4096 + 2× https://claude.ai/code/session_01TbdBLJPXpbK5wGHpLAQ9x4

Novel: TTT adapts ONLY scalar/control parameters (q_gain, attn_scale, mlp_scale, resid_mix, RMSNorm weights, skip_weights, skip_gates). Matrix weights (c_q/c_k/c_v/proj/MLP/tok_emb) stay frozen. This is mechanistically different from full-model TTT (openai#1413, openai#537): the model retunes its existing control knobs rather than learning new weight directions. Higher LR (0.01) since scalars need bigger steps. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@Robby955

…am Tilt — val_bpb 1.07800 (3-seed mean) 3-lever stack on top of PR openai#1394 sp8192 baseline: - Parallel Residuals on layers 7-10 (PR openai#1412 by @Robby955) - 3-layer depth recurrence (LOOP_START=3 LOOP_END=5, extends PR openai#1394's 2-layer recurrence) - Eval-time causal n-gram tilt (PR openai#1420 by @abaybektursun, lineage PR openai#1145 by @AnirudhRahul) Plus our existing PR openai#1413 stack: QK_GAIN_INIT=5, score-first legal TTT (LR=0.005, epochs=3). Results (3-seed mean, 8xH100 SXM): - val_bpb 1.07800 (std 0.00053) - val_loss 2.78457 nats per token - Beats PR openai#1394 (1.08563) by 0.01971 nats per token - Beats PR openai#1420 (1.08014) by 0.00553 nats per token - Beats own PR openai#1413 (1.08279) by 0.01237 nats per token All four issue openai#1017 conditions verified for the n-gram tilt path: prefix-only hash construction, full-vocab renormalized one-token tilt, score-before-update ordering inside the C++ kernel, single left-to-right pass. C++ n-gram kernel ported from PR openai#1420 with the nanobind dependency removed (extern "C" shim + ctypes loader, single g++ -shared invocation at runtime). 5-seed re-verification via the shipped mini wrapper is in progress; this PR will be updated with the final 5-seed mean once s1337 and s2025 land.

Adds s1337 (1.07801) and s2025 (1.07862) via the shipped mini wrapper. The 5-seed mean is +0.00013 worse than the initial 3-seed mean (1.07800) which is well within the std (~0.00046). Margins vs the legal open chronology are unchanged in direction: - vs PR openai#1394 (1.08563): -0.01938 nats per token (margin +0.01438 over 0.005 bar) - vs PR openai#1420 (1.08014): -0.00520 nats per token (margin +0.00020 over 0.005 bar) - vs own PR openai#1413 (1.08279): -0.01205 nats per token 3 of 5 seeds (s42, s1337, s2025) are now mini-wrapper-verified for fit; s0 and s1234 mini-wrapper re-runs still in progress.

All 5 seeds (s0, s42, s1234, s1337, s2025) re-run via the shipped mini wrapper. The mean improves slightly from the prior mixed-source 1.07813 to 1.07807 because s1234 produced a noticeably lower TTT under the mini wrapper (1.07813 mini vs 1.07848 raw, -0.00035 — within float64 reordering noise but the largest single-seed drift in the verification set). All 5 artifact sizes are direct from the mini-wrapper runs (NOT projections): - s0: 15,992,304 bytes (7,696 byte headroom) - s42: 15,993,733 bytes (6,267 byte headroom) - s1234: 15,990,539 bytes (9,461 byte headroom) - s1337: 15,988,039 bytes (11,961 byte headroom) - s2025: 15,992,215 bytes (7,785 byte headroom) Margins vs the legal open chronology: - vs PR openai#1394 (1.08563): -0.01952 nats per token (margin +0.01452 over 0.005 bar) - vs PR openai#1420 (1.08014): -0.00534 nats per token (margin +0.00034 over 0.005 bar) - vs own PR openai#1413 (1.08279): -0.01218 nats per token All four issue openai#1017 conditions remain verified for the n-gram tilt path.

@abaybektursun

…-only experts The original n-gram tilt kernel inherited from PR openai#1420 had a causality bug: within_hint() and word_hint() in fused_expert_kernel.cpp::get_hints_batch gated their emission on is_bnd[tokens_[p]] / is_ws[tokens_[p]] (target token metadata at the position being scored), leaking 1-2 bits about the answer per scored position. This is an Issue openai#1017 condition 2 violation. PR openai#1420 has the identical bug. @abaybektursun has acknowledged it in PR openai#1420's thread and proposed the same fix that's applied here: * fused_expert_kernel.cpp: derive is_bnd / is_ws from tokens_[p-1] (last prefix token) for hint gating. Updates use the actual current tok via new tok_is_bnd / tok_is_ws variables so within_update / word_update still segment words correctly. Variable naming and structure copied verbatim from PR openai#1420's fix. * Run command updated to set NGRAM_WITHIN_BETA=0 NGRAM_WORD_BETA=0. Empirically the within / word experts under prefix-only gating fire for the wrong positions (within fires for word-starts, word fires for mid-word) and contribute *negative* BPB. Disabling them gives 1.07951 on s42 vs 1.08108 with the experts active — token_hint is the only legitimate contributor. 5-seed verification (all on the patched kernel): seed pre-fix corrected delta 0 1.07751 1.08035 +0.00284 42 1.07809 1.08097 +0.00288 1234 1.07813 1.08127 +0.00314 1337 1.07801 1.08060 +0.00259 2025 1.07862 1.08135 +0.00273 mean 1.07807 1.08091 +0.00284 All 5 artifacts fit under 16 MB (15,988,802 - 15,995,572 bytes; 4.4-11.2 KB headroom). Pre-fix per-seed values preserved in submission.json under seed_results_pre_fix for the public record. Bar comparisons (corrected mean 1.08091): PR openai#1394 (1.08563): beats by +0.00472, fails 0.005 nat record bar PR openai#1413 ours (1.08279): beats by +0.00188, fails record bar PR openai#1420 (1.08014): we lose by 0.00077 (PR openai#1420 also tainted by the same bug; would correct to ~1.08300 post-fix) This PR is left open as a transparency / diagnostic record, NOT as a record claim. PR openai#1413 (no n-gram tilt at all) at 1.08279 remains our cleanest legal anchor. The README has been retitled "Diagnostic (causal-corrected)" and the legality fix is documented in a dedicated section.

Novel mechanism: zero-initialized nn.Embedding(4096, 512) created at eval time, trained exclusively through the standard score-first TTT loop. Learns document-local bigram patterns without modifying any artifact weights. Hash: h = (prev_token * 2039 + curr_token) % 4096 Injection: tok_emb(x) + eval_hash_emb(h), before RMSNorm Compliance: same score-first pattern as openai#549/openai#1413 TTT precedent. Precedent for eval-time params: LoRA-TTT (openai#1254, openai#1354). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…d execution Materializes two local record folders from fetched refs pr1413 and pr1437 using a builder script that preserves the upstream FORMAT_RAW+FILTER_LZMA2 wrapper format with roundtrip decode validation. Scripts: - prepare_pr1413_variants.py: offline builder with wrapper-format fidelity - runpod_1413.sh: single-run launcher with conditional final_model.pt copy - runpod_1413_batch.sh: sequential A/B/C/D/E runner with shared archive timestamp, one-time SP8192 prep, per-run subdirectories, and g++ guard Run contract: A: faithful openai#1413 control (16,719 code bytes) B: PARALLEL_RESIDUAL_START=7 C: LOOP_START=3 LOOP_END=5 D: parallel residual + loop adjustment (17,390 code bytes) E: eval-only n-gram tilt on D checkpoint (SKIP_TRAINING=1) Campaign docs updated to reflect the strategy pivot from single-seed openai#1413 to full offline A/B/C/D/E batch prep. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

@abaybektursun

…reshold The previous "Diagnostic" framing was based on a unit error: I compared val_bpb deltas as if they were nats-per-token deltas, missing the factor of ~2.583 (mean bytes per token in the sp8192 val set, computable directly from this submission's val_loss / val_bpb ratio). With the correct units, the causal-corrected 5-seed mean (1.08091 BPB, 2.79210 nats/token) clears the 0.005-nat record bar against PR openai#1394: vs PR openai#1394 (1.08563): +0.01219 nats per token ✅ 2.4× the bar vs PR openai#1019 (1.11473): +0.08736 nats per token ✅ comfortably vs PR openai#1413 (ours): +0.00486 nats per token — essentially tied vs PR openai#1420 (1.08014): -0.00199 nats — but PR openai#1420 has the same kernel bug; its corrected ~1.08298 yields +0.00535 nats ✅ Title reverted from "Diagnostic (causal-corrected)" to "Record". The legality fix section is preserved (the kernel patch is still a real correctness fix matching @abaybektursun's proposed patch in PR openai#1420). The leak magnitude in the legality fix section now correctly states "+0.00284 BPB ≈ +0.00734 nats per token" instead of just BPB. Pre-fix per-seed values are still preserved in submission.json under seed_results_pre_fix for the public record.

Base: PR openai#1394 (SP8192 + GPTQ Embeddings + SDClip + DR + MuonEq-R) Novel: RDClip (Rate-Distortion Clip) — per-group GPTQ clip search that minimizes compressed_bytes + lambda * Hessian_weighted_MSE. Extends SDClip's fixed formula to empirical rate-distortion optimization. Groups: embed, attn_qk, attn_vo, mlp, other. Search: 5 multipliers per group on first tensor. Also added: score-first TTT (ported from R12, same as openai#549/openai#1413). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove RDClip to establish baseline for openai#1394 + TTT. Tests whether the base + TTT matches openai#1413's 1.08279. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Novel: Context-only delta optimization during eval. Per-batch additive delta (512-dim) optimized with AdamW on ONLY already-scored positions. New positions scored with optimized delta. Model weights frozen. Fixes openai#1229's minibatch leakage: context = positions scored in PREVIOUS windows only. No cross-window contamination within current batch. Same compliance pattern as score-first TTT (openai#549/openai#1413). Based on openai#1333's proven causal SLOT mechanism (-0.013 BPP on SP4096). Stack: R12 SP8192 + score-first TTT + hash embedding + causal SLOT. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Phase 5a is a trivial-wins composition on top of v6.1 SLOT-100 baseline (2026-04-08_v61_h100_aggressive_slot_steps100, 1.146523): 1) QK_GAIN_INIT=5.0 (PR openai#1413) 2) MUON_EQ_R=1 (Newton-Schulz row L2 normalize, PR openai#1394) 3) --ema 0.9965 (PR openai#1421/openai#1445, vs prior 0.997) 4) HIDDEN_MULT=5.0 (FFN dim 4x->5x, byte re-investment from int6 tied embed) 5) EMBED_QUANT_BITS=6 EMBED_QUANT_TOK_EMB=1 (Phase 1A int6 tied embed, -0.6 MB on rANS artifact) 3-seed val_bpb at SLOT lr=0.1 steps=100 stride=64 (mid-eval 28-29% of full sliding-window): s1337: 1.144045 (28.7% of windows) s1338: 1.142021 (28.7%) s1339: 1.141649 (29.4%) ------- mean: 1.142572 std: 0.001247 Delta vs prior 2026-04-08_v61_h100_aggressive_slot_steps100 (1.146523): -0.003951 bpb Submitted as non-record because 1.142572 does not beat the current PR openai#1019 record (1.1147). The Phase 5a stack documents both the trivial-wins composition AND the negative ablations from Phases 1B/1C/2A-C/3/5b that other submitters can skip: Phase 1B (FP32 scalar -> Int8): only -0.05 MB, kept Phase 1C (Pentanary -> Ternary BitNet b1.58 1-layer sanity): regression +0.014 bpb, abandoned Phase 1A pent_tok (Tied embed Pentanary): regression +0.043 bpb, abandoned Phase 2A (Inter-layer delta prediction Wl - Wl-1): delta entropy HIGHER than W (per-layer ranges differ), abandoned Phase 2B (Hadamard 16-dim block transform): no rANS gain, abandoned Phase 2C (Context-aware rANS lookup table): rans_codec_rs Rust rebuild blocker, abandoned Phase 3 (Custom HQGRANS1 binary container, pickle bypass): only -70 KB rans / +17 KB after lzma9 -- pickle isn't actually leaking 30%, abandoned Phase 4 architecture sweep (1-seed s1337 SLOT-100 stride=64): p5a (no extra) ~1.144 base p5a_bg4096 ~1.146 hurts p5a_hm5 ~1.144 -> 1.142 (3-seed) BEST p5a_bg4096_hm5 ~1.144 tie p5a_bg8192 ~1.148 hurts p5a_nl12 ~1.147 hurts p5a_ve4 ~1.150 hurts Phase 5b (Depth Recurrence PR openai#1239 style): nl9r2 (unique 9 x recur 2 = 18 effective): 30% eval @ 1.151, abandoned nl7r2 (unique 7 x recur 2 = 14 effective): 92% eval @ 1.166, abandoned The 28-29% mid-eval window is the converged region: per-window cumulative bpb has flattened to within +/-0.001 of the 100% value in every prior 3-seed SLOT-100 run we have measured. Full 100%-eval is in flight on the same H100 pod and will be appended in a follow-up commit if the final number differs from the mid-eval estimate. Code change vs 2026-04-08_v61_h100_aggressive_slot_steps100/train_gpt.py is purely env-var driven (no source-code changes to the model architecture or serializer). The training script picks up the Phase 5a env vars at import time (make_model() reads HIDDEN_MULT, EMBED_QUANT_BITS, etc). Reproducibility: bash records/track_non_record_16mb/2026-04-09_v62_p5a_hm5_phase5a/run.sh both 1337 bash records/track_non_record_16mb/2026-04-09_v62_p5a_hm5_phase5a/run.sh both 1338 bash records/track_non_record_16mb/2026-04-09_v62_p5a_hm5_phase5a/run.sh both 1339 Hardware: 8x H100 80GB SXM (RunPod). 600s wallclock training, ~50 min single-GPU SLOT-100 eval per seed (eval is unbounded). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…pattern)

Two of the three comp-frontier wins are env-var bumps with no code change: - LOOP_START 4 → 3 (with NUM_LOOPS=2 and LOOP_END=5 this gives 3-layer recurrence on layers 3/4/5 instead of 2-layer on 4/5). PR openai#1485 / openai#1471 / openai#1437 use this. Expected -0.005 to -0.01 BPB. - QK_GAIN_INIT 4 → 5. PRs openai#1413, openai#1423, openai#1485, openai#1437, openai#1351, openai#1408 are at 5; openai#1482 is at 5.25. PR openai#1477's default 4 is below the leaderboard curve. Expected -0.001 BPB. C1 (Pre-Quant AdamW TTT) is the bigger win (-0.014 BPB) but requires real code — agent is researching PR openai#1485 / openai#1416 / openai#1306 implementations in background. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Both techniques from PR openai#1477 (SP8192: 1.0822 BPB, -0.033 vs record): - QK_GAIN_INIT=5.0 (was 4.0): PR openai#1477/openai#1413 both use 5.0, ~0.002-0.005 BPB - TTT_ENABLED=1, TTT_OPTIMIZER=adamw, TTT_LR=0.0005, TTT_EPOCHS=3, TTT_FREEZE_BLOCKS=1 Score-First TTT = legal PR openai#461 protocol: score in inference_mode first, then adapt. run_legal_ttt() implementation is strictly causal. NOT included (confirmed illegal): - SLOT: 2-pass retroactive, optimizes delta on same tokens it scores - Pre-Quant TTT (PR openai#1482/openai#1489): adapts model with fineweb_val_*.bin before GPTQ — dexhunter flagged as val-data-in-training violation Baseline: sota_32 (WD=0.04, EMA=0.9965, RECUR=4,5, Mousse=1) Expected: ~1.08-1.09 BPB

…nthesis (validation pending) First submission to stack three independently-legal val-data adaptations on the PR openai#1487 (1.0600) base: 1. Pre-Quant AdamW TTT pushed to 11 epochs with freeze_blocks=0 (Track A) 2. Val-Calibrated GPTQ — Hessian H=X^T X computed from validation activations to align quantization with the eval distribution (novel on the modern stack; PR openai#1019 ablated this on its older base only) 3. Eval-Time Legal Score-First TTT 2 epochs with score-before-update ordering (Track B, builds on PR openai#1493) The three knobs attack the 0.0187 BPB quantization gap measured in PR openai#1487 (1.0415 post-prequant-TTT FP -> 1.0602 post-quant sliding) from independent angles. PR openai#1487's eval_val_ttt code path is unchanged but enabled via env vars. Code diff vs PR openai#1487 base: 186 lines (~100 added in new collect_hessians_val function, plus 8 hyperparameter defaults flipped). Architecture, optimizer, training loop, EMA, and quantization machinery are byte-identical to PR openai#1487. Projected val_bpb range: 1.0452 - 1.0542 (center 1.0497), which would clear the 0.005-nat SOTA threshold over PR openai#1487. Worst case ~1.054 (still strong non-record). py_compile clean. 3-seed validation requires ~$15-25 of 8xH100 SXM time on RunPod; see VALIDATION.md. Compliance: Track A (artifact-baked val-data adaptation) + Track B (eval-time score-first TTT). No SLOT, no n-gram cache, no ETLB. Credits: PR openai#1487 ndokutovich, PR openai#1493 bigbag, PR openai#1019 abaybektursun, PR openai#1394 clarkkev, PR openai#1413 dexhunter, PR openai#549 abaybektursun, PR openai#1412 Robby955, PR openai#1204 msisovic, PR openai#1423 aryanbhosale, PR openai#1445 X-Abhishek-X.

@bigbag

Porting the full merged SOTA stack from bigbag/parameter-golf PR openai#1493: - SP8192 tokenizer (replaces SP1024) - 3-layer depth recurrence (L3-5, activate at 0.35 × iter) - Parallel residuals (GPT-J style) on L>=7 - QK-Gain 5.0 (default) / 5.25 (SOTA config) - Score-first TTT: SGD lr=0.005, momentum=0.9, 3 epochs - GPTQ SDClip: int6 matrices (k=12.85), int8 embeddings (k=20.0) - LZMA+b85 code wrapper pattern - Tuned hyperparams: WD=0.095, MLR=0.022, EMA=0.9965, warmdown=0.72 This is the clean, legal, compliant baseline. All 4 Issue openai#1017 conditions satisfied. Next: validate reproduction on 3 seeds, then add VarLen attention. Source: records/track_10min_16mb/2026-04-09_SP8192_3LayerRecur_ParResid_QK525_LegalTTT/ from upstream/main, decompressed from the lzma+b85 wrapper. Credits: @bigbag (PR openai#1493), @clarkkev (PR openai#1394), @dexhunter (PR openai#1413), @abaybektursun (PR openai#549), @Robby955 (PR openai#1412), @msisovic (PR openai#1204) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Extends the existing GPTQ path to match current frontier records (PR openai#1394, openai#1413 family). Two independent knobs added, both backward-compatible: - GPTQ_EMBED=1: include the (tied) embedding matrix in the GPTQ calibration path. A new forward hook on final_norm captures the Hessian for the F.linear(x, tok_emb.weight) output projection, so the embedding can be quantized with Hessian-aware column rounding instead of amax per-row. quantize_state_dict_int8 drops the "not is_embed" guard on the GPTQ branch. - USE_SDCLIP=1, SDCLIP_K=2.5: per-row scale uses k*std(W) instead of amax or percentile clipping. Applied in both gptq_quantize_layer and quantize_float_tensor 2D paths. Current frontier uses this to avoid outlier-driven scale inflation. dev/run_frontier.sh enables GPTQ_EMBED=1 USE_SDCLIP=1 SDCLIP_K=2.5 by default. All three are off by default in Hyperparameters so the existing QAT/amax paths are unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Per-token NLL rescaled by detached, clipped, mean-1-normalized ratio of own NLL to batch-mean NLL, raised to alpha (warmup-ramped). Bit-identical to PR openai#1413 (1.0810 main frontier) when LOSS_REWEIGHT_ALPHA=0. Patch is 4 surgical edits to PR openai#1413 train_gpt.py: hyperparameters (+4 env vars), GPT.__init__ (+_train_step buffer), GPT.forward (constant-branch on alpha==0 else weighted CE), step_fn (fill _train_step each step). Wrapped LZMA script grew 308 bytes; tightest base seed keeps ~7.7KB headroom under 16MB cap. README acknowledges prior negative results (PR openai#1360 Gaussian reweight, PR openai#1233 focal gamma=2, PR openai#1380 focal investigation) and frames this as replication on a stronger TTT-heavy base where train-time hardness focus could interact with eval-time TTT in ways the older bases can't show.

@clarkkev

…val_bpb 1.07983 3-seed mean val_bpb 1.07983 (std 0.00050) on the PR openai#1394 sp8192 stack. Changes from PR openai#1394 + PR openai#1413 baseline: - Muon momentum = 0.97 (vs 0.99 default), warmup 0.92→0.97 unchanged - Causal token n-gram tilt (base_beta=2.0, agree_bonus=0.1) on top of legal score-first TTT; within-word and word-start experts explicitly disabled (within_beta=0, word_beta=0) because they cannot be made fully causal. - 3-seed verification (seeds 0/42/1234) Seeds: - seed 0 → 1.07928 bpb / 2.78790 nats / 15,993,346 bytes - seed 42 → 1.07997 bpb / 2.78967 nats / 15,992,995 bytes - seed 1234 → 1.08025 bpb / 2.79039 nats / 15,994,604 bytes - mean → 1.07983 bpb / 2.78932 nats / 15,993,648 bytes Delta vs current merged SOTA PR openai#1493 (1.0810): 0.00117 bpb / 0.00302 nats per token Credits: @clarkkev (base PR openai#1394 sp8192 stack), @abaybektursun (n-gram tilt kernel PR openai#1420, causal fix applied), prior legal-TTT precedent PR openai#549 / PR openai#461. Platform: 8xH100 80GB SXM, PyTorch 2.9.1+cu128. Training 588s, eval <437s per seed, both under the 600s budget. Artifact under 16 MB on all 3 seeds.

resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 6, 2026

Add score-first TTT eval (ported from PR openai#1413) with loop-only …

3968719

…freeze support

resouer mentioned this pull request Apr 6, 2026

Record: SP8192 + Parallel Residuals + Coprime-Stride + TTT — val_bpb 1.08286 (3-seed mean) resouer/parameter-golf#10

Closed

dexhunter mentioned this pull request Apr 7, 2026

Record: SP8192 + Parallel Residuals + 3-Layer Recurrence + Token-Only N-gram Tilt — val_bpb 1.08091 (5-seed mean, causal-corrected) #1437

Open

resouer mentioned this pull request Apr 8, 2026

Record: SP8192 + TTT + Eval-Time Hash Embedding — val_bpb 1.08269 (3-seed mean) #1460

Closed

sisegod mentioned this pull request Apr 8, 2026

Non-record: v6.2 Phase 5a SOTA-trivial stack (3-seed @76% = 1.136399, -0.010 vs prior; TTT 1.205 not competitive) #1465

Open

12 tasks

This was referenced Apr 8, 2026

Record: SP8192 + Parallel Residuals + Score-First TTT — val_bpb 1.0822 (3-seed mean) #1477

Merged

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 8, 2026

Add legal score-first TTT eval (PR openai#549/openai#1413/openai#1477 …

e9df0ea

…pattern)

This was referenced Apr 9, 2026

Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean) #1492

Closed

Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean) #1493

Merged

owizdom mentioned this pull request Apr 9, 2026

Non-record: Pre-Quant TTT 11ep + Val-Calibrated GPTQ + SLOT-24 — quad-stack synthesis (validation pending compute) #1498

Open

7 tasks

MatoTeziTanka mentioned this pull request Apr 12, 2026

[Record Submission] SP8192 + QK5 + Legal TTT — val_bpb 1.0842 | 15.99MB #1476

Open

abbudjoe mentioned this pull request Apr 12, 2026

[Non Record] Fractal recurrent primitive hybrid - SP1024 1xH100 #1569

Closed

joshkmartinez mentioned this pull request Apr 12, 2026

Record: GDN-Hybrid + Sliding Window Attention + compressed-code warmdown1000 - val_bpb 1.01671 (3-seed mean) #1576

Closed

zenithlight mentioned this pull request Apr 13, 2026

Reproducibility issues with #1413 and #1493 #1591

Closed

codemath3000 mentioned this pull request Apr 14, 2026

New record submissions for review (#1583, #1584, #1585) #1587

Open

himanshudongre mentioned this pull request Apr 18, 2026

Record: SP8192 + BigramHash d=32 + Path A v3 passthrough quantization — val_bpb 1.07882 (3-seed mean) #1716

Open

kiyoaki mentioned this pull request Apr 18, 2026

Notable: SP8192 + 3-Layer Recurrence + Parallel Residuals - 5-Seed Quantization Reference and SDClip Ablations #1720

Open

5 tasks

Victory963 mentioned this pull request Apr 19, 2026

Record: SP8192 + Hadamard Rotation + AWQ + Layer-wise Precision + Hessian-Aware Calibration + Legal TTT — val_bpb 1.0785 (3-seed mean) #1731

Closed

OE-GOD mentioned this pull request Apr 20, 2026

Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT + CaseOps Tokenizer — val_bpb 1.07462 (3-seed mean) #1755

Open

anmarhindi mentioned this pull request Apr 22, 2026

Record: SP8192 ParResid 3LayerLoop QK5.25 LegalTTT — 1.08083 BPB #1776

Open

jamesEmerson112 mentioned this pull request Apr 24, 2026

Record: SP8192 + Headwise Gated Attention + LeakyReLU2 + Legal TTT (val_bpb 1.2073) #1799

Open

PranavViswanath mentioned this pull request Apr 24, 2026

Record: SP8192 + Gram-NS + Polar Express + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0800 (3-seed mean) #1809

Open

EthanNing mentioned this pull request Apr 26, 2026

Records: SP8192 + LegalTTT 4ep — 1.0729 (Δ -0.0081 vs 04-09, p<1e-7) #1812

Open

anmarhindi mentioned this pull request Apr 26, 2026

Record: SP8192 + PPM-D byte mixture — 1.00136 BPB (3-seed mean) #1835

Open

dexhunter mentioned this pull request Apr 27, 2026

Record: Pre-Quant TTT + Void Compass — val_bpb 1.0282 (3-seed mean) #1852

Closed

G3sparky mentioned this pull request Apr 27, 2026

SP8192 + Score-First TTT + QK-Gain 5.25 — Neural-Only val_bpb 1.0810 (3-seed mean) #1858

Open

robbiebusinessacc mentioned this pull request Apr 28, 2026

Record: PR #1854 neural stack — budget-compliant 1.06777 (3-seed mean) #1883

Open

6 tasks

hilbertmeng mentioned this pull request Apr 29, 2026

[Record]: MUDD Connections + SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT— val_bpb 1.0769 (3-seed mean) #1936

Open

okezue mentioned this pull request Apr 30, 2026

Record: PreQuantTTT + Sliding Window on PR #1855 stack, val_bpb=1.01355 (3-seed) #1958

Closed

This was referenced Apr 30, 2026

Record: SP10240 SimCTG + 3-Layer Recurrence — 1.07502 sliding-window (3-seed) #1971

Open

Record: SP10240 SimCTG + PreQuantTTT — 1.03983 sliding-window (3-seed) #1972

Open

This was referenced Apr 30, 2026

Record: SP8192 Full Stack + Headwise Gated Attention + PreQuantTTT (1.0511 BPB, 3-seed) #1992

Closed

Record: SP8192 + Headwise Gated Attention + Legal TTT (1.0805 BPB, 3-seed) #2005

Open

BharathSShankar mentioned this pull request Apr 30, 2026

Record: SP10240 + SimCTG + QAHSP + post-quant TTT — 1.07197 ttt-sliding-window (3-seed mean, std 0.00023) #2022

Open

4 tasks

jamesEmerson112 mentioned this pull request May 1, 2026

Record: SP8192 + Headwise Gate + EMA 0.990 + Small Batch (1.0066 BPB, 3-seed) #2071

Open

okezue mentioned this pull request May 1, 2026

Non-record: Post-Quantization LoRA Distillation (LCQ) on PR #1855 stack, val_bpb=1.06767 #2128

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: SP8192 + QK-Gain 5 + Legal Score-First TTT — val_bpb 1.08279 (3-seed mean)#1413

Record: SP8192 + QK-Gain 5 + Legal Score-First TTT — val_bpb 1.08279 (3-seed mean)#1413
cocohearts merged 1 commit intoopenai:mainfrom
dexhunter:record/sp8192-qk5-legal-ttt-1.08279

dexhunter commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dexhunter commented Apr 6, 2026

Summary

Per-seed (post-TTT)

Lineage / change from PR #1394

Compliance (Issue #1017 four conditions)

Reproduction

Credits

Files

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants