Record: PR #1738 + PreQuant TTT LR=1e-3 + Unfrozen — val_bpb 1.02840 (3-seed mean)#1758
Record: PR #1738 + PreQuant TTT LR=1e-3 + Unfrozen — val_bpb 1.02840 (3-seed mean)#1758kilojoules wants to merge 6 commits intoopenai:mainfrom
Conversation
…02767 (3-seed mean) Two-hparam retune of PR openai#1738's pre-quant TTT phase: - PREQUANT_TTT_LR: 5e-4 -> 1e-3 (TTT was undertrained; epoch 21 loss still descending) - PREQUANT_TTT_FREEZE_BLOCKS: 2 -> 0 (freezing unnecessary at 21-epoch budget) No architecture, tokenizer, or main-training changes. train_gpt.py is PR openai#1738's file with two os.environ.setdefault lines prepended. 3-seed mean 1.02767 vs PR openai#1738 1.03540 -> Delta 0.00773 nats, p ~ 0.005 (t=9.7, df=2). All artifacts under 16 MB, all runs under 600 s train + 600 s eval.
PR openai#1738's packed train_gpt.py crashes on pytorch 2.5.1 with 'FlashAttention only supports fp16, bf16, and fp8_e4m3' because q/k/v can arrive as fp32 after torch.compile passes. Replace with packed variant that includes the bf16 cast around flash_attn_3_func. Same byte_count category (~25 KB), no WD_TAPER, same functional code path. Matches the binary actually used to produce the train_seed43/44/45 logs. Artifact stays under 16 MB.
….00025 (p<0.001) Previous logs came from a functionally-equivalent packed variant that differed by 37 bytes (contained dead WD_TAPER code). This commit replaces all three seed logs with runs produced by the exact train_gpt.py committed in this folder (Code size: 24,893 bytes in all logs). New stats: - seed 43: 1.02846 (15,999,201 bytes) - seed 44: 1.02812 (15,993,435 bytes) - seed 45: 1.02861 (15,999,551 bytes) - mean 1.02840, std 0.00025 - t-test vs 1.03040: t=13.8, df=2, p<0.001 - Delta vs PR openai#1738 = 0.00700 nats Mean shifted up 0.00073 vs earlier logs (different vast.ai machine, same pytorch 2.5.1+cu124) but std halved, so statistical confidence is stronger.
…uant TTT; Recurrence Depth Curriculum; Parcae stable loops - SOTA 1.0810 still holds (Day 12 plateau, longest in competition history) - PR openai#1758 (1.02840): pre-quant TTT — 6th attempt at illegal pattern, ignore - PR openai#1756 (1.06505): CaseOps + Recurrence Depth Curriculum (depth 1→3→4); has BOS bug; awaits Issue openai#1604 - PR openai#1755 (1.07462): CaseOps + Legal TTT; awaits Issue openai#1604 - New paper: Parcae (arXiv:2604.12946) — stable looped LMs via spectral norm constraint on injection params, relevant to Triple Loop stability - New paper: Gated Attention (arXiv:2505.06708, NeurIPS 2025) — backs PR openai#1667 Attention Output Gate - Added Session 18 lessons learned; Issue openai#1604 self-deadline Apr 24 - Primary action: implement PR openai#1586+openai#1667 immediately (9 days to deadline) https://claude.ai/code/session_0151v7YeWWUSnhmcC8U8NGUV
|
Legality concern (self-disclosed) This PR builds on PR #1735's I did not catch this before opening the PR; flagging proactively now. If OpenAI staff rules The submission as currently written claims a record only if the underlying PR #1735 / #1738 mechanism is accepted as legal; otherwise it is not a valid record submission. Sorry for the noise and thanks for the review. |
Summary
PR #1738 (@alertcat) is undertuned in its pre-quant TTT phase. Evidence: at the default
PREQUANT_TTT_LR=5e-4, TTT loss is still descending at the final epoch 21 of the 21-epoch TTT cosine schedule; withPREQUANT_TTT_FREEZE_BLOCKS=2, two blocks are held fixed during the TTT pass even though the 21-epoch budget on held-out legal tokens leaves no overfitting regime to protect against.This PR is functionally PR #1738 with two
os.environ.setdefaultlines prepended totrain_gpt.pythat flip two defaults. The packedtrain_gpt.pyblob is the PR #1738 code with a small FlashAttention-3 fix applied (ato(bf16)cast around theflash_attn_3_funccall). Without that cast, PR #1738'strain_gpt.pycrashes on pytorch 2.5.1 withRuntimeError: FlashAttention only supports fp16, bf16, and fp8_e4m3becauseq/k/vcan reach that call as fp32 aftertorch.compilerewrites; the cast is behaviorally identical on pytorch 2.9.1 (PR #1738's stack). No other code changes.PREQUANT_TTT_LR5e-41e-3PREQUANT_TTT_FREEZE_BLOCKS20No architecture, tokenizer, main-training, or evaluation changes.
3-seed results (8× H100 80GB SXM)
Δ vs PR #1738 = 0.00700 nats (vs 0.005 required). One-sided t-test of our samples vs μ₀ = 1.03040: t ≈ 13.8, df = 2, p < 0.001.
All three artifacts are under 16 MB (worst margin 449 bytes, best 6,565 bytes). Logs were produced by running the exact
train_gpt.pycommitted in the records folder (Code size: 24,893 bytes).Why
PR #1735/#1738's pre-quant TTT phase was undertuned at defaults:
PREQUANT_TTT_LR=5e-4, TTT loss was still descending at final epoch 21.Doubling LR and unfreezing both help monotonically; higher LRs (1.5e-3, 2e-3) diverge under the 21-epoch budget.
Per-epoch pre-quant TTT val_bpb (y-axis: val_bpb on legal held-out tokens before GPTQ quantization; lower is better), seed 42. At
LR=2.5e-4the curve flattens well before epoch 21; atLR=1e-3 freeze=2it's still descending at the final epoch;freeze=0pushes the endpoint further.Freeze-depth sweep at
LR=1e-3, seed 42. Y-axis: scored sliding-window val_bpb (stride-64); lower is better. Strictly monotone — less freezing is always better within this range under the 21-epoch TTT budget.Dependency on PR #1738
This PR is a delta on an open PR (#1738). If #1738 is closed or superseded, this PR will be rebased onto the replacement or withdrawn — it does not claim a record independent of #1738's contribution.
Test plan
train_gpt.py)p < 0.01vs PR Record: PR #1735 + CaseOps Tokenizer V15 (val_bpb 1.03540, mean of 3 seeds) #1738 − 0.005 = 1.03040 (actually p < 0.001)Note: these runs used pytorch 2.5.1+cu124 on vast.ai. PR #1738 reported on 2.9.1+cu128. A single-seed reproduction of PR #1738 defaults on this stack lands at 1.03612, i.e. 0.0007 above PR #1738's claim — the stack drift is an order of magnitude smaller than the improvement reported here.
Attribution
PR #1738 (@alertcat) base, PR #1735 (@AjAnubolu) parallel pre-quant TTT, PR #1729 (@romeerp) CaseOps tokenizer, PR #1493 (@bigbag) QK-Gain 5.25, PR #1412 (@Robby955) parallel residuals, PR #1331 (@dexhunter) depth recurrence, PR #1394 (@clarkkev) SP8192 + GPTQ SDClip.