Skip to content

Record: SP8192 + Depth Recurrence x2 + GPTQ + Score-First TTT + fused-softcap-ce -- val_bpb 1.07974 (3-seed mean)#1572

Open
anthony-maio wants to merge 5 commits intoopenai:mainfrom
anthony-maio:submission/sp8192-frontier
Open

Record: SP8192 + Depth Recurrence x2 + GPTQ + Score-First TTT + fused-softcap-ce -- val_bpb 1.07974 (3-seed mean)#1572
anthony-maio wants to merge 5 commits intoopenai:mainfrom
anthony-maio:submission/sp8192-frontier

Conversation

@anthony-maio
Copy link
Copy Markdown

Summary

val_bpb = 1.07974 (3-seed mean, std 0.00058) | ~15.99 MB | 8xH100 SXM

Seed Sliding BPB TTT BPB Artifact
1337 1.08048 1.07907 15,992,450
42 1.08159 1.08014 15,989,801
2024 1.08124 1.08001 15,989,704
Mean 1.08110 1.07974 15,990,652
Std -- 0.00058 --

Stack

  • SP8192 tokenizer
  • 11-layer transformer, depth recurrence x2 (layers 3-5 looped, virtual sequence encoder [0,1,2,3,4,5,3,4] / decoder [5,3,4,5,6,7,8,9,10])
  • Loop activates at 35% of training (~step 2026)
  • QK-Gain 5.25, EMA 0.997
  • GPTQ INT6 weights + INT8 embeddings + brotli
  • Score-first TTT: SGD lr=0.005, 3 epochs, 1238 chunks (~5.5 min eval)
  • Train script lzma+base85+exec compressed to 16,793 bytes

Original Contributions

The train_gpt.py artifact uses a lzma+base85+exec shim -- the full source is LZMA2-compressed and base85-encoded into a two-line exec wrapper. Comes out to 16,793 bytes vs 58,367 raw. That frees ~41KB of the 16MB budget for model weights.

The submission also ships a requirements.txt that installs fused-softcap-ce, a pip-packaged CUDA kernel fusing softcap*tanh(x/softcap) with cross-entropy in a single pass. 3.63x faster than PyTorch on H100 for the forward scoring path. Forward-only, so it runs during TTT scoring and sliding-window eval. Graceful fallback to stock PyTorch if the kernel isn't available at eval time.

Compliance (Track B -- Score-First TTT)

Per Issue #1017:

  • Each chunk scored under no_grad() before any TTT gradient step
  • Single left-to-right pass, no rescoring
  • No pre-quant TTT, no SLOT, no n-gram cache
  • All artifacts < 16,000,000 bytes, train < 600s, eval < 600s

Credits

PR #1493 @bigbag -- base config (loop_start=3, loop_end=5, num_loops=2, enable_looping_at=0.35, parallel_residual_start=7)
PR #1394 @clarkkev -- depth recurrence + GPTQ + SDClip pipeline
PR #1420 @abaybektursun -- SP8192 tokenizer integration

Copilot AI review requested due to automatic review settings April 12, 2026 18:48
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new Track 10min / 16MB record submission directory (2026-04-12_SP8192_Frontier) including the training/eval artifact, dependency manifest, and 3-seed logs supporting the reported BPB.

Changes:

  • Added compressed train_gpt.py submission artifact (LZMA+base85+exec wrapper) and an uncompressed reference script (train_gpt_sota.py).
  • Added requirements.txt for an optional fused softcap cross-entropy kernel.
  • Added training/eval logs for seeds 42/1337/2024 (baseline + “frontier” runs with TTT).

Reviewed changes

Copilot reviewed 3 out of 9 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
records/track_10min_16mb/2026-04-12_SP8192_Frontier/train_gpt.py Compressed submission entrypoint that execs the embedded source.
records/track_10min_16mb/2026-04-12_SP8192_Frontier/train_gpt_sota.py Uncompressed reference implementation of the submission training/eval pipeline.
records/track_10min_16mb/2026-04-12_SP8192_Frontier/requirements.txt Adds external dependency for fused scoring kernel.
records/track_10min_16mb/2026-04-12_SP8192_Frontier/train_seed42.log Baseline seed-42 training/eval log.
records/track_10min_16mb/2026-04-12_SP8192_Frontier/train_seed42_frontier.log Frontier seed-42 training/eval + TTT log.
records/track_10min_16mb/2026-04-12_SP8192_Frontier/train_seed1337.log Baseline seed-1337 training/eval log.
records/track_10min_16mb/2026-04-12_SP8192_Frontier/train_seed1337_frontier.log Frontier seed-1337 training/eval + TTT log.
records/track_10min_16mb/2026-04-12_SP8192_Frontier/train_seed2024.log Baseline seed-2024 training/eval log.
records/track_10min_16mb/2026-04-12_SP8192_Frontier/train_seed2024_frontier.log Frontier seed-2024 training/eval + TTT log.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread records/track_10min_16mb/2026-04-12_SP8192_Frontier/requirements.txt Outdated
anthony-maio added a commit to anthony-maio/parameter-golf that referenced this pull request Apr 14, 2026
…is regressive on our SP8192 + depth recurrence stack

Three configs tested at seed 42 on 8xH100 SXM:
- VarLen + Fused MLP: 1.93 pre-quant val_bpb, 1440 steps, 2.3M tok/s (3.4x slower)
- Fused MLP only: 1.110 pre-quant val_bpb, 2581 steps, 3.4M tok/s (2.3x slower)
- Pure baseline reproduction: pod terminated mid-run before completion

Root cause: VarLen + depth recurrence + fullgraph torch.compile triggers cascading
shape recompilations (combinatorial explosion of loop_iter x cu_seqlens shape)
that overflow even a 64-entry compile cache. Fused MLP Triton kernel has per-call
TensorDescriptor allocation overhead that doesn't amortize for our hidden_dim=2048.

Conclusion: do not ship this port. PR openai#1572 (1.07974) remains best submission.
Move 2 (per-layer GPTQ from PR openai#1586) and Move 3 (LoRA TTT from PR openai#1530, eval-only
so no torch.compile recompile concern) are still viable next directions.
anthony-maio added a commit to anthony-maio/parameter-golf that referenced this pull request Apr 14, 2026
…192 stack

Config-level changes only, no kernel/compile changes that could interact with
our depth recurrence stack (unlike VarLen port in submission/sp8192-varlen-frontier):

- MLP_CLIP_SIGMAS 12.0 (tight, preserve MLP precision)
- ATTN_CLIP_SIGMAS 13.0 (looser, save bytes on attention weights)
- EMBED_BITS 8 -> 7 with EMBED_CLIP_SIGMAS 20.0 -> 15.0 (~530 KB artifact savings)
- MATRIX_LR 0.022 -> 0.026 (dexhunter 6-point sweep optimum)
- WARMDOWN_FRAC 0.72 -> 0.75 (longer peak LR window)

Dexhunter measured 1.07493 BPB (3-seed mean) applying these against PR openai#1530 base.
Against our 1.07974 SP8192 baseline the expected delta is in the 0.003-0.005 BPB
range; the adaptive clip is stack-independent and the embed-bits + LR tweaks are
universal. Fresh branch from upstream/main per PR hygiene (PR openai#1572 untouched).
anthony-maio added a commit to anthony-maio/parameter-golf that referenced this pull request Apr 14, 2026
Seed 42 quantized_ttt 1.25076 vs sliding 1.08292 on AP-IN-1 pod. Eval time 910s
(non-compliant, exceeds 600s cap). Root cause: two LoRA application semantics
bugs -- mlp_lora has wrong output dim (hidden_dim vs dim) and wrong integration
point (inner tweak vs residual bypass), o_lora takes wrong input tensor.

Session aggregate: 3 consecutive samacqua-derived improvements (VarLen, per-layer
GPTQ, LoRA TTT) either regressed or were neutral on our SP8192 + depth recurrence
stack. His improvements are co-tuned with his architecture; porting requires
structural rework that would amount to reproducing his submission.

PR openai#1572 (1.07974) stays our best.
anthony-maio added a commit to anthony-maio/parameter-golf that referenced this pull request Apr 14, 2026
Applied council-recommended fixes: mlp_lora (bsz,dim,dim) parallel residual bypass,
o_lora takes pre-attn norm, loop-layer tighter clip_sigmas, pod speedgate. Seed 42
on fast AP-IN-1 pod (100.6ms/step, passed gate):

- sliding_window: 1.08150 (equivalent to pre-fix)
- TTT: 1.25268 (identical regression pattern to pre-fix 1.25076)
- eval: 882s (still non-compliant)
- artifact: 16.41 MB (now non-compliant, was 15.98 before LOOP_CLIP_SIGMAS=10)

The semantic bugs were real but NOT the root cause. Deeper bug remains in
forward_ttt or LoRA optimizer loop. 4 consecutive Lineage B port attempts
have confirmed this lineage does not graft onto our depth recurrence stack.

PR openai#1572 (1.07974) remains best. Stopping Lineage B ports.
anthony-maio added a commit to anthony-maio/parameter-golf that referenced this pull request Apr 18, 2026
Per-doc per-slot LoRA banks indexed by position in the unrolled
encoder/decoder schedule, not by physical block index. With NUM_LOOPS=2
the recurrent band (layers 3-5) appears 3x in the schedule; this
design gives each call site its own adapter while sharing the frozen W.
That is the structural fix that lets Lineage B style batched LoRA TTT
compose with our depth recurrence.

Also generalizes TTT parameter selection on top of the existing
TTT_SELECTIVE_LAYERS knob:
- TTT_LAYER_IDS for explicit layer targeting
- TTT_MODE in {control, qv, qkv, full, none}
- TTT_INCLUDE_GLOBAL adds skip_weights and skip_gates

CUDA event phase timing in eval_val_ttt logs scored-fwd, ttt-fwd,
backward, all-reduce, and optimizer separately so we can decide whether
further investment goes into eval kernel work or into TTT tuning.

All new behavior is opt-in via env. Defaults preserve PR openai#1572
(1.07974 BPB) behavior. Score forward falls back to eager when the
LoRA bank is attached because dynamic per-batch slicing of the bank
breaks fullgraph compile -- known followup.

CPU smoke tests in test_scaffolding.py cover shape correctness, slot
guard (inactive slot perturbations produce zero output diff), gradient
flow into active vs inactive slots, and all five TTT_MODE selections.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants