Record: SP8192 + Depth Recurrence x2 + GPTQ + Score-First TTT + fused-softcap-ce -- val_bpb 1.07974 (3-seed mean)#1572
Open
anthony-maio wants to merge 5 commits intoopenai:mainfrom
Open
Conversation
… val_bpb 1.07974 (3-seed mean) Seeds 1337, 42, 2024 on 8xH100 SXM with fused-softcap-ce kernel integration.
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a new Track 10min / 16MB record submission directory (2026-04-12_SP8192_Frontier) including the training/eval artifact, dependency manifest, and 3-seed logs supporting the reported BPB.
Changes:
- Added compressed
train_gpt.pysubmission artifact (LZMA+base85+exec wrapper) and an uncompressed reference script (train_gpt_sota.py). - Added
requirements.txtfor an optional fused softcap cross-entropy kernel. - Added training/eval logs for seeds 42/1337/2024 (baseline + “frontier” runs with TTT).
Reviewed changes
Copilot reviewed 3 out of 9 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| records/track_10min_16mb/2026-04-12_SP8192_Frontier/train_gpt.py | Compressed submission entrypoint that execs the embedded source. |
| records/track_10min_16mb/2026-04-12_SP8192_Frontier/train_gpt_sota.py | Uncompressed reference implementation of the submission training/eval pipeline. |
| records/track_10min_16mb/2026-04-12_SP8192_Frontier/requirements.txt | Adds external dependency for fused scoring kernel. |
| records/track_10min_16mb/2026-04-12_SP8192_Frontier/train_seed42.log | Baseline seed-42 training/eval log. |
| records/track_10min_16mb/2026-04-12_SP8192_Frontier/train_seed42_frontier.log | Frontier seed-42 training/eval + TTT log. |
| records/track_10min_16mb/2026-04-12_SP8192_Frontier/train_seed1337.log | Baseline seed-1337 training/eval log. |
| records/track_10min_16mb/2026-04-12_SP8192_Frontier/train_seed1337_frontier.log | Frontier seed-1337 training/eval + TTT log. |
| records/track_10min_16mb/2026-04-12_SP8192_Frontier/train_seed2024.log | Baseline seed-2024 training/eval log. |
| records/track_10min_16mb/2026-04-12_SP8192_Frontier/train_seed2024_frontier.log | Frontier seed-2024 training/eval + TTT log. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
anthony-maio
added a commit
to anthony-maio/parameter-golf
that referenced
this pull request
Apr 14, 2026
…is regressive on our SP8192 + depth recurrence stack Three configs tested at seed 42 on 8xH100 SXM: - VarLen + Fused MLP: 1.93 pre-quant val_bpb, 1440 steps, 2.3M tok/s (3.4x slower) - Fused MLP only: 1.110 pre-quant val_bpb, 2581 steps, 3.4M tok/s (2.3x slower) - Pure baseline reproduction: pod terminated mid-run before completion Root cause: VarLen + depth recurrence + fullgraph torch.compile triggers cascading shape recompilations (combinatorial explosion of loop_iter x cu_seqlens shape) that overflow even a 64-entry compile cache. Fused MLP Triton kernel has per-call TensorDescriptor allocation overhead that doesn't amortize for our hidden_dim=2048. Conclusion: do not ship this port. PR openai#1572 (1.07974) remains best submission. Move 2 (per-layer GPTQ from PR openai#1586) and Move 3 (LoRA TTT from PR openai#1530, eval-only so no torch.compile recompile concern) are still viable next directions.
anthony-maio
added a commit
to anthony-maio/parameter-golf
that referenced
this pull request
Apr 14, 2026
…192 stack Config-level changes only, no kernel/compile changes that could interact with our depth recurrence stack (unlike VarLen port in submission/sp8192-varlen-frontier): - MLP_CLIP_SIGMAS 12.0 (tight, preserve MLP precision) - ATTN_CLIP_SIGMAS 13.0 (looser, save bytes on attention weights) - EMBED_BITS 8 -> 7 with EMBED_CLIP_SIGMAS 20.0 -> 15.0 (~530 KB artifact savings) - MATRIX_LR 0.022 -> 0.026 (dexhunter 6-point sweep optimum) - WARMDOWN_FRAC 0.72 -> 0.75 (longer peak LR window) Dexhunter measured 1.07493 BPB (3-seed mean) applying these against PR openai#1530 base. Against our 1.07974 SP8192 baseline the expected delta is in the 0.003-0.005 BPB range; the adaptive clip is stack-independent and the embed-bits + LR tweaks are universal. Fresh branch from upstream/main per PR hygiene (PR openai#1572 untouched).
anthony-maio
added a commit
to anthony-maio/parameter-golf
that referenced
this pull request
Apr 14, 2026
Seed 42 quantized_ttt 1.25076 vs sliding 1.08292 on AP-IN-1 pod. Eval time 910s (non-compliant, exceeds 600s cap). Root cause: two LoRA application semantics bugs -- mlp_lora has wrong output dim (hidden_dim vs dim) and wrong integration point (inner tweak vs residual bypass), o_lora takes wrong input tensor. Session aggregate: 3 consecutive samacqua-derived improvements (VarLen, per-layer GPTQ, LoRA TTT) either regressed or were neutral on our SP8192 + depth recurrence stack. His improvements are co-tuned with his architecture; porting requires structural rework that would amount to reproducing his submission. PR openai#1572 (1.07974) stays our best.
anthony-maio
added a commit
to anthony-maio/parameter-golf
that referenced
this pull request
Apr 14, 2026
Applied council-recommended fixes: mlp_lora (bsz,dim,dim) parallel residual bypass, o_lora takes pre-attn norm, loop-layer tighter clip_sigmas, pod speedgate. Seed 42 on fast AP-IN-1 pod (100.6ms/step, passed gate): - sliding_window: 1.08150 (equivalent to pre-fix) - TTT: 1.25268 (identical regression pattern to pre-fix 1.25076) - eval: 882s (still non-compliant) - artifact: 16.41 MB (now non-compliant, was 15.98 before LOOP_CLIP_SIGMAS=10) The semantic bugs were real but NOT the root cause. Deeper bug remains in forward_ttt or LoRA optimizer loop. 4 consecutive Lineage B port attempts have confirmed this lineage does not graft onto our depth recurrence stack. PR openai#1572 (1.07974) remains best. Stopping Lineage B ports.
anthony-maio
added a commit
to anthony-maio/parameter-golf
that referenced
this pull request
Apr 18, 2026
Per-doc per-slot LoRA banks indexed by position in the unrolled
encoder/decoder schedule, not by physical block index. With NUM_LOOPS=2
the recurrent band (layers 3-5) appears 3x in the schedule; this
design gives each call site its own adapter while sharing the frozen W.
That is the structural fix that lets Lineage B style batched LoRA TTT
compose with our depth recurrence.
Also generalizes TTT parameter selection on top of the existing
TTT_SELECTIVE_LAYERS knob:
- TTT_LAYER_IDS for explicit layer targeting
- TTT_MODE in {control, qv, qkv, full, none}
- TTT_INCLUDE_GLOBAL adds skip_weights and skip_gates
CUDA event phase timing in eval_val_ttt logs scored-fwd, ttt-fwd,
backward, all-reduce, and optimizer separately so we can decide whether
further investment goes into eval kernel work or into TTT tuning.
All new behavior is opt-in via env. Defaults preserve PR openai#1572
(1.07974 BPB) behavior. Score forward falls back to eager when the
LoRA bank is attached because dynamic per-batch slicing of the bank
breaks fullgraph compile -- known followup.
CPU smoke tests in test_scaffolding.py cover shape correctness, slot
guard (inactive slot perturbations produce zero output diff), gradient
flow into active vs inactive slots, and all five TTT_MODE selections.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
val_bpb = 1.07974 (3-seed mean, std 0.00058) | ~15.99 MB | 8xH100 SXM
Stack
Original Contributions
The train_gpt.py artifact uses a lzma+base85+exec shim -- the full source is LZMA2-compressed and base85-encoded into a two-line exec wrapper. Comes out to 16,793 bytes vs 58,367 raw. That frees ~41KB of the 16MB budget for model weights.
The submission also ships a requirements.txt that installs fused-softcap-ce, a pip-packaged CUDA kernel fusing softcap*tanh(x/softcap) with cross-entropy in a single pass. 3.63x faster than PyTorch on H100 for the forward scoring path. Forward-only, so it runs during TTT scoring and sliding-window eval. Graceful fallback to stock PyTorch if the kernel isn't available at eval time.
Compliance (Track B -- Score-First TTT)
Per Issue #1017:
Credits
PR #1493 @bigbag -- base config (loop_start=3, loop_end=5, num_loops=2, enable_looping_at=0.35, parallel_residual_start=7)
PR #1394 @clarkkev -- depth recurrence + GPTQ + SDClip pipeline
PR #1420 @abaybektursun -- SP8192 tokenizer integration