Record: PR #1738 + PreQuant TTT LR=1e-3 + Unfrozen — val_bpb 1.02840 (3-seed mean) by kilojoules · Pull Request #1758 · openai/parameter-golf

kilojoules · 2026-04-21T03:26:24Z

Summary

PR #1738 (@alertcat) is undertuned in its pre-quant TTT phase. Evidence: at the default PREQUANT_TTT_LR=5e-4, TTT loss is still descending at the final epoch 21 of the 21-epoch TTT cosine schedule; with PREQUANT_TTT_FREEZE_BLOCKS=2, two blocks are held fixed during the TTT pass even though the 21-epoch budget on held-out legal tokens leaves no overfitting regime to protect against.

This PR is functionally PR #1738 with two os.environ.setdefault lines prepended to train_gpt.py that flip two defaults. The packed train_gpt.py blob is the PR #1738 code with a small FlashAttention-3 fix applied (a to(bf16) cast around the flash_attn_3_func call). Without that cast, PR #1738's train_gpt.py crashes on pytorch 2.5.1 with RuntimeError: FlashAttention only supports fp16, bf16, and fp8_e4m3 because q/k/v can reach that call as fp32 after torch.compile rewrites; the cast is behaviorally identical on pytorch 2.9.1 (PR #1738's stack). No other code changes.

env var	PR #1738 default	this PR
`PREQUANT_TTT_LR`	`5e-4`	`1e-3`
`PREQUANT_TTT_FREEZE_BLOCKS`	`2`	`0`

No architecture, tokenizer, main-training, or evaluation changes.

3-seed results (8× H100 80GB SXM)

Seed	val_bpb (sliding)	artifact
43	1.02846	15,999,201
44	1.02812	15,993,435
45	1.02861	15,999,551
mean / std	1.02840 / 0.00025

Δ vs PR #1738 = 0.00700 nats (vs 0.005 required). One-sided t-test of our samples vs μ₀ = 1.03040: t ≈ 13.8, df = 2, p < 0.001.

All three artifacts are under 16 MB (worst margin 449 bytes, best 6,565 bytes). Logs were produced by running the exact train_gpt.py committed in the records folder (Code size: 24,893 bytes).

Why

PR #1735/#1738's pre-quant TTT phase was undertuned at defaults:

At PREQUANT_TTT_LR=5e-4, TTT loss was still descending at final epoch 21.
Freezing the first 2 blocks reduces adapt capacity with no overfitting regime to protect against at 21 epochs on held-out legal tokens.

Doubling LR and unfreezing both help monotonically; higher LRs (1.5e-3, 2e-3) diverge under the 21-epoch budget.

Per-epoch pre-quant TTT val_bpb (y-axis: val_bpb on legal held-out tokens before GPTQ quantization; lower is better), seed 42. At LR=2.5e-4 the curve flattens well before epoch 21; at LR=1e-3 freeze=2 it's still descending at the final epoch; freeze=0 pushes the endpoint further.

Freeze-depth sweep at LR=1e-3, seed 42. Y-axis: scored sliding-window val_bpb (stride-64); lower is better. Strictly monotone — less freezing is always better within this range under the 21-epoch TTT budget.

Dependency on PR #1738

This PR is a delta on an open PR (#1738). If #1738 is closed or superseded, this PR will be rebased onto the replacement or withdrawn — it does not claim a record independent of #1738's contribution.

Test plan

3 seeds 43/44/45, train logs included (produced by the exact committed train_gpt.py)
All 3 artifacts under 16 MB
Sliding-window eval under 600 s (~110-125 s)
p < 0.01 vs PR Record: PR #1735 + CaseOps Tokenizer V15 (val_bpb 1.03540, mean of 3 seeds) #1738 − 0.005 = 1.03040 (actually p < 0.001)
OpenAI verification on RunPod template (can re-run if review reaches this stage)

Note: these runs used pytorch 2.5.1+cu124 on vast.ai. PR #1738 reported on 2.9.1+cu128. A single-seed reproduction of PR #1738 defaults on this stack lands at 1.03612, i.e. 0.0007 above PR #1738's claim — the stack drift is an order of magnitude smaller than the improvement reported here.

Attribution

PR #1738 (@alertcat) base, PR #1735 (@AjAnubolu) parallel pre-quant TTT, PR #1729 (@romeerp) CaseOps tokenizer, PR #1493 (@bigbag) QK-Gain 5.25, PR #1412 (@Robby955) parallel residuals, PR #1331 (@dexhunter) depth recurrence, PR #1394 (@clarkkev) SP8192 + GPTQ SDClip.

…02767 (3-seed mean) Two-hparam retune of PR openai#1738's pre-quant TTT phase: - PREQUANT_TTT_LR: 5e-4 -> 1e-3 (TTT was undertrained; epoch 21 loss still descending) - PREQUANT_TTT_FREEZE_BLOCKS: 2 -> 0 (freezing unnecessary at 21-epoch budget) No architecture, tokenizer, or main-training changes. train_gpt.py is PR openai#1738's file with two os.environ.setdefault lines prepended. 3-seed mean 1.02767 vs PR openai#1738 1.03540 -> Delta 0.00773 nats, p ~ 0.005 (t=9.7, df=2). All artifacts under 16 MB, all runs under 600 s train + 600 s eval.

PR openai#1738's packed train_gpt.py crashes on pytorch 2.5.1 with 'FlashAttention only supports fp16, bf16, and fp8_e4m3' because q/k/v can arrive as fp32 after torch.compile passes. Replace with packed variant that includes the bf16 cast around flash_attn_3_func. Same byte_count category (~25 KB), no WD_TAPER, same functional code path. Matches the binary actually used to produce the train_seed43/44/45 logs. Artifact stays under 16 MB.

….00025 (p<0.001) Previous logs came from a functionally-equivalent packed variant that differed by 37 bytes (contained dead WD_TAPER code). This commit replaces all three seed logs with runs produced by the exact train_gpt.py committed in this folder (Code size: 24,893 bytes in all logs). New stats: - seed 43: 1.02846 (15,999,201 bytes) - seed 44: 1.02812 (15,993,435 bytes) - seed 45: 1.02861 (15,999,551 bytes) - mean 1.02840, std 0.00025 - t-test vs 1.03040: t=13.8, df=2, p<0.001 - Delta vs PR openai#1738 = 0.00700 nats Mean shifted up 0.00073 vs earlier logs (different vast.ai machine, same pytorch 2.5.1+cu124) but std halved, so statistical confidence is stronger.

…uant TTT; Recurrence Depth Curriculum; Parcae stable loops - SOTA 1.0810 still holds (Day 12 plateau, longest in competition history) - PR openai#1758 (1.02840): pre-quant TTT — 6th attempt at illegal pattern, ignore - PR openai#1756 (1.06505): CaseOps + Recurrence Depth Curriculum (depth 1→3→4); has BOS bug; awaits Issue openai#1604 - PR openai#1755 (1.07462): CaseOps + Legal TTT; awaits Issue openai#1604 - New paper: Parcae (arXiv:2604.12946) — stable looped LMs via spectral norm constraint on injection params, relevant to Triple Loop stability - New paper: Gated Attention (arXiv:2505.06708, NeurIPS 2025) — backs PR openai#1667 Attention Output Gate - Added Session 18 lessons learned; Issue openai#1604 self-deadline Apr 24 - Primary action: implement PR openai#1586+openai#1667 immediately (9 days to deadline) https://claude.ai/code/session_0151v7YeWWUSnhmcC8U8NGUV

kilojoules · 2026-04-21T20:37:05Z

Legality concern (self-disclosed)

This PR builds on PR #1735's pre_quant_adamw_ttt mechanism. On 2026-04-21 I read PR #1735's review thread where @dexhunter flagged this mechanism as potentially violating Issue #1017 Condition 3 (score-before-update) / Track B (no TTT on val tokens that will be scored), and the PR #1735 author @AjAnubolu indicated he would revise to score-first per-chunk if the mechanism is deemed illegal. Our PR inherits the exact same mechanism — we only retuned two hyperparameters (PREQUANT_TTT_LR 5e-4 → 1e-3, PREQUANT_TTT_FREEZE_BLOCKS 2 → 0).

I did not catch this before opening the PR; flagging proactively now.

If OpenAI staff rules pre_quant_adamw_ttt (as implemented in PR #1735 / #1738) illegal under Track B, we acknowledge this PR is voided by the same ruling. In that case please reclassify under track_non_record_16mb, or close and we will resubmit either as a non-record submission or with a score-first legal rewrite.

The submission as currently written claims a record only if the underlying PR #1735 / #1738 mechanism is accepted as legal; otherwise it is not a valid record submission. Sorry for the noise and thanks for the review.

kilojoules added 6 commits April 20, 2026 20:25

README: disclose s42 scout, add seed/dependency/size-cap notes

30b1903

Add figures: TTT convergence curves + freeze-depth sweep

a3f98d3

README: disclose FA3 bf16-cast patch in packed train_gpt.py

d0e9a2f

kilojoules changed the title ~~Record: PR #1738 + PreQuant TTT LR=1e-3 + Unfrozen — val_bpb 1.02767 (3-seed mean)~~ Record: PR #1738 + PreQuant TTT LR=1e-3 + Unfrozen — val_bpb 1.02840 (3-seed mean) Apr 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: PR #1738 + PreQuant TTT LR=1e-3 + Unfrozen — val_bpb 1.02840 (3-seed mean)#1758

Record: PR #1738 + PreQuant TTT LR=1e-3 + Unfrozen — val_bpb 1.02840 (3-seed mean)#1758
kilojoules wants to merge 6 commits intoopenai:mainfrom
kilojoules:submission/sp8192-prequant-ttt-unfrozen

kilojoules commented Apr 21, 2026 •

edited

Loading

Uh oh!

kilojoules commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kilojoules commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

3-seed results (8× H100 80GB SXM)

Why

Dependency on PR #1738

Test plan

Attribution

Uh oh!

kilojoules commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kilojoules commented Apr 21, 2026 •

edited

Loading