Record: SP8192 + Depth Recurrence x2 + GPTQ + Score-First TTT + fused-softcap-ce -- val_bpb 1.07974 (3-seed mean) by anthony-maio · Pull Request #1572 · openai/parameter-golf

anthony-maio · 2026-04-12T18:48:34Z

Summary

val_bpb = 1.07974 (3-seed mean, std 0.00058) | ~15.99 MB | 8xH100 SXM

Seed	Sliding BPB	TTT BPB	Artifact
1337	1.08048	1.07907	15,992,450
42	1.08159	1.08014	15,989,801
2024	1.08124	1.08001	15,989,704
Mean	1.08110	1.07974	15,990,652
Std	--	0.00058	--

Stack

SP8192 tokenizer
11-layer transformer, depth recurrence x2 (layers 3-5 looped, virtual sequence encoder [0,1,2,3,4,5,3,4] / decoder [5,3,4,5,6,7,8,9,10])
Loop activates at 35% of training (~step 2026)
QK-Gain 5.25, EMA 0.997
GPTQ INT6 weights + INT8 embeddings + brotli
Score-first TTT: SGD lr=0.005, 3 epochs, 1238 chunks (~5.5 min eval)
Train script lzma+base85+exec compressed to 16,793 bytes

Original Contributions

The train_gpt.py artifact uses a lzma+base85+exec shim -- the full source is LZMA2-compressed and base85-encoded into a two-line exec wrapper. Comes out to 16,793 bytes vs 58,367 raw. That frees ~41KB of the 16MB budget for model weights.

The submission also ships a requirements.txt that installs fused-softcap-ce, a pip-packaged CUDA kernel fusing softcap*tanh(x/softcap) with cross-entropy in a single pass. 3.63x faster than PyTorch on H100 for the forward scoring path. Forward-only, so it runs during TTT scoring and sliding-window eval. Graceful fallback to stock PyTorch if the kernel isn't available at eval time.

Compliance (Track B -- Score-First TTT)

Per Issue #1017:

Each chunk scored under no_grad() before any TTT gradient step
Single left-to-right pass, no rescoring
No pre-quant TTT, no SLOT, no n-gram cache
All artifacts < 16,000,000 bytes, train < 600s, eval < 600s

Credits

PR #1493 @bigbag -- base config (loop_start=3, loop_end=5, num_loops=2, enable_looping_at=0.35, parallel_residual_start=7)
PR #1394 @clarkkev -- depth recurrence + GPTQ + SDClip pipeline
PR #1420 @abaybektursun -- SP8192 tokenizer integration

… SDClip)

…1394 baseline

… val_bpb 1.07974 (3-seed mean) Seeds 1337, 42, 2024 on 8xH100 SXM with fused-softcap-ce kernel integration.

Copilot

Pull request overview

Adds a new Track 10min / 16MB record submission directory (2026-04-12_SP8192_Frontier) including the training/eval artifact, dependency manifest, and 3-seed logs supporting the reported BPB.

Changes:

Added compressed train_gpt.py submission artifact (LZMA+base85+exec wrapper) and an uncompressed reference script (train_gpt_sota.py).
Added requirements.txt for an optional fused softcap cross-entropy kernel.
Added training/eval logs for seeds 42/1337/2024 (baseline + “frontier” runs with TTT).

Reviewed changes

Copilot reviewed 3 out of 9 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
records/track_10min_16mb/2026-04-12_SP8192_Frontier/train_gpt.py	Compressed submission entrypoint that execs the embedded source.
records/track_10min_16mb/2026-04-12_SP8192_Frontier/train_gpt_sota.py	Uncompressed reference implementation of the submission training/eval pipeline.
records/track_10min_16mb/2026-04-12_SP8192_Frontier/requirements.txt	Adds external dependency for fused scoring kernel.
records/track_10min_16mb/2026-04-12_SP8192_Frontier/train_seed42.log	Baseline seed-42 training/eval log.
records/track_10min_16mb/2026-04-12_SP8192_Frontier/train_seed42_frontier.log	Frontier seed-42 training/eval + TTT log.
records/track_10min_16mb/2026-04-12_SP8192_Frontier/train_seed1337.log	Baseline seed-1337 training/eval log.
records/track_10min_16mb/2026-04-12_SP8192_Frontier/train_seed1337_frontier.log	Frontier seed-1337 training/eval + TTT log.
records/track_10min_16mb/2026-04-12_SP8192_Frontier/train_seed2024.log	Baseline seed-2024 training/eval log.
records/track_10min_16mb/2026-04-12_SP8192_Frontier/train_seed2024_frontier.log	Frontier seed-2024 training/eval + TTT log.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…is regressive on our SP8192 + depth recurrence stack Three configs tested at seed 42 on 8xH100 SXM: - VarLen + Fused MLP: 1.93 pre-quant val_bpb, 1440 steps, 2.3M tok/s (3.4x slower) - Fused MLP only: 1.110 pre-quant val_bpb, 2581 steps, 3.4M tok/s (2.3x slower) - Pure baseline reproduction: pod terminated mid-run before completion Root cause: VarLen + depth recurrence + fullgraph torch.compile triggers cascading shape recompilations (combinatorial explosion of loop_iter x cu_seqlens shape) that overflow even a 64-entry compile cache. Fused MLP Triton kernel has per-call TensorDescriptor allocation overhead that doesn't amortize for our hidden_dim=2048. Conclusion: do not ship this port. PR openai#1572 (1.07974) remains best submission. Move 2 (per-layer GPTQ from PR openai#1586) and Move 3 (LoRA TTT from PR openai#1530, eval-only so no torch.compile recompile concern) are still viable next directions.

…192 stack Config-level changes only, no kernel/compile changes that could interact with our depth recurrence stack (unlike VarLen port in submission/sp8192-varlen-frontier): - MLP_CLIP_SIGMAS 12.0 (tight, preserve MLP precision) - ATTN_CLIP_SIGMAS 13.0 (looser, save bytes on attention weights) - EMBED_BITS 8 -> 7 with EMBED_CLIP_SIGMAS 20.0 -> 15.0 (~530 KB artifact savings) - MATRIX_LR 0.022 -> 0.026 (dexhunter 6-point sweep optimum) - WARMDOWN_FRAC 0.72 -> 0.75 (longer peak LR window) Dexhunter measured 1.07493 BPB (3-seed mean) applying these against PR openai#1530 base. Against our 1.07974 SP8192 baseline the expected delta is in the 0.003-0.005 BPB range; the adaptive clip is stack-independent and the embed-bits + LR tweaks are universal. Fresh branch from upstream/main per PR hygiene (PR openai#1572 untouched).

Seed 42 quantized_ttt 1.25076 vs sliding 1.08292 on AP-IN-1 pod. Eval time 910s (non-compliant, exceeds 600s cap). Root cause: two LoRA application semantics bugs -- mlp_lora has wrong output dim (hidden_dim vs dim) and wrong integration point (inner tweak vs residual bypass), o_lora takes wrong input tensor. Session aggregate: 3 consecutive samacqua-derived improvements (VarLen, per-layer GPTQ, LoRA TTT) either regressed or were neutral on our SP8192 + depth recurrence stack. His improvements are co-tuned with his architecture; porting requires structural rework that would amount to reproducing his submission. PR openai#1572 (1.07974) stays our best.

Applied council-recommended fixes: mlp_lora (bsz,dim,dim) parallel residual bypass, o_lora takes pre-attn norm, loop-layer tighter clip_sigmas, pod speedgate. Seed 42 on fast AP-IN-1 pod (100.6ms/step, passed gate): - sliding_window: 1.08150 (equivalent to pre-fix) - TTT: 1.25268 (identical regression pattern to pre-fix 1.25076) - eval: 882s (still non-compliant) - artifact: 16.41 MB (now non-compliant, was 15.98 before LOOP_CLIP_SIGMAS=10) The semantic bugs were real but NOT the root cause. Deeper bug remains in forward_ttt or LoRA optimizer loop. 4 consecutive Lineage B port attempts have confirmed this lineage does not graft onto our depth recurrence stack. PR openai#1572 (1.07974) remains best. Stopping Lineage B ports.

Per-doc per-slot LoRA banks indexed by position in the unrolled encoder/decoder schedule, not by physical block index. With NUM_LOOPS=2 the recurrent band (layers 3-5) appears 3x in the schedule; this design gives each call site its own adapter while sharing the frozen W. That is the structural fix that lets Lineage B style batched LoRA TTT compose with our depth recurrence. Also generalizes TTT parameter selection on top of the existing TTT_SELECTIVE_LAYERS knob: - TTT_LAYER_IDS for explicit layer targeting - TTT_MODE in {control, qv, qkv, full, none} - TTT_INCLUDE_GLOBAL adds skip_weights and skip_gates CUDA event phase timing in eval_val_ttt logs scored-fwd, ttt-fwd, backward, all-reduce, and optimizer separately so we can decide whether further investment goes into eval kernel work or into TTT tuning. All new behavior is opt-in via env. Defaults preserve PR openai#1572 (1.07974 BPB) behavior. Score forward falls back to eager when the LoRA bank is attached because dynamic per-batch slicing of the bank breaks fullgraph compile -- known followup. CPU smoke tests in test_scaffolding.py cover shape correctness, slot guard (inactive slot perturbations produce zero output diff), gradient flow into active vs inactive slots, and all five TTT_MODE selections.

anthony-maio added 4 commits April 12, 2026 10:04

SP8192 frontier: adapted from PR openai#1394 (depth recurrence, GPTQ,…

6f75a8b

… SDClip)

SP8192 SOTA: compressed openai#1493 script + 3-seed logs from openai#…

bd3015a

…1394 baseline

Integrate fused-softcap-ce kernel (3.63x on H100) into SP8192 SOTA

78fac92

Record: SP8192 + Depth Recurrence x2 + GPTQ INT6 + Score-First TTT --…

bef8226

… val_bpb 1.07974 (3-seed mean) Seeds 1337, 42, 2024 on 8xH100 SXM with fused-softcap-ce kernel integration.

Copilot AI review requested due to automatic review settings April 12, 2026 18:48

Copilot started reviewing on behalf of anthony-maio April 12, 2026 18:49 View session

Copilot AI reviewed Apr 12, 2026

View reviewed changes

Pin fused-softcap-ce to commit SHA for reproducibility

3706d56

Bortlesboat mentioned this pull request Apr 13, 2026

[Tool] parameter-golf-checker: static analysis reviewer aid for submission triage #1603

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: SP8192 + Depth Recurrence x2 + GPTQ + Score-First TTT + fused-softcap-ce -- val_bpb 1.07974 (3-seed mean)#1572

Record: SP8192 + Depth Recurrence x2 + GPTQ + Score-First TTT + fused-softcap-ce -- val_bpb 1.07974 (3-seed mean)#1572
anthony-maio wants to merge 5 commits intoopenai:mainfrom
anthony-maio:submission/sp8192-frontier

anthony-maio commented Apr 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

anthony-maio commented Apr 12, 2026

Summary

Stack

Original Contributions

Compliance (Track B -- Score-First TTT)

Credits

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants