Skip to content

Record: PR #1738 + PreQuant TTT LR=1e-3 + Unfrozen — val_bpb 1.02840 (3-seed mean)#1758

Open
kilojoules wants to merge 6 commits intoopenai:mainfrom
kilojoules:submission/sp8192-prequant-ttt-unfrozen
Open

Record: PR #1738 + PreQuant TTT LR=1e-3 + Unfrozen — val_bpb 1.02840 (3-seed mean)#1758
kilojoules wants to merge 6 commits intoopenai:mainfrom
kilojoules:submission/sp8192-prequant-ttt-unfrozen

Conversation

@kilojoules
Copy link
Copy Markdown

@kilojoules kilojoules commented Apr 21, 2026

Summary

PR #1738 (@alertcat) is undertuned in its pre-quant TTT phase. Evidence: at the default PREQUANT_TTT_LR=5e-4, TTT loss is still descending at the final epoch 21 of the 21-epoch TTT cosine schedule; with PREQUANT_TTT_FREEZE_BLOCKS=2, two blocks are held fixed during the TTT pass even though the 21-epoch budget on held-out legal tokens leaves no overfitting regime to protect against.

This PR is functionally PR #1738 with two os.environ.setdefault lines prepended to train_gpt.py that flip two defaults. The packed train_gpt.py blob is the PR #1738 code with a small FlashAttention-3 fix applied (a to(bf16) cast around the flash_attn_3_func call). Without that cast, PR #1738's train_gpt.py crashes on pytorch 2.5.1 with RuntimeError: FlashAttention only supports fp16, bf16, and fp8_e4m3 because q/k/v can reach that call as fp32 after torch.compile rewrites; the cast is behaviorally identical on pytorch 2.9.1 (PR #1738's stack). No other code changes.

env var PR #1738 default this PR
PREQUANT_TTT_LR 5e-4 1e-3
PREQUANT_TTT_FREEZE_BLOCKS 2 0

No architecture, tokenizer, main-training, or evaluation changes.

3-seed results (8× H100 80GB SXM)

Seed val_bpb (sliding) artifact
43 1.02846 15,999,201
44 1.02812 15,993,435
45 1.02861 15,999,551
mean / std 1.02840 / 0.00025

Δ vs PR #1738 = 0.00700 nats (vs 0.005 required). One-sided t-test of our samples vs μ₀ = 1.03040: t ≈ 13.8, df = 2, p < 0.001.

All three artifacts are under 16 MB (worst margin 449 bytes, best 6,565 bytes). Logs were produced by running the exact train_gpt.py committed in the records folder (Code size: 24,893 bytes).

Why

PR #1735/#1738's pre-quant TTT phase was undertuned at defaults:

  1. At PREQUANT_TTT_LR=5e-4, TTT loss was still descending at final epoch 21.
  2. Freezing the first 2 blocks reduces adapt capacity with no overfitting regime to protect against at 21 epochs on held-out legal tokens.

Doubling LR and unfreezing both help monotonically; higher LRs (1.5e-3, 2e-3) diverge under the 21-epoch budget.

TTT convergence

Per-epoch pre-quant TTT val_bpb (y-axis: val_bpb on legal held-out tokens before GPTQ quantization; lower is better), seed 42. At LR=2.5e-4 the curve flattens well before epoch 21; at LR=1e-3 freeze=2 it's still descending at the final epoch; freeze=0 pushes the endpoint further.

Freeze-depth sweep

Freeze-depth sweep at LR=1e-3, seed 42. Y-axis: scored sliding-window val_bpb (stride-64); lower is better. Strictly monotone — less freezing is always better within this range under the 21-epoch TTT budget.

Dependency on PR #1738

This PR is a delta on an open PR (#1738). If #1738 is closed or superseded, this PR will be rebased onto the replacement or withdrawn — it does not claim a record independent of #1738's contribution.

Test plan

Note: these runs used pytorch 2.5.1+cu124 on vast.ai. PR #1738 reported on 2.9.1+cu128. A single-seed reproduction of PR #1738 defaults on this stack lands at 1.03612, i.e. 0.0007 above PR #1738's claim — the stack drift is an order of magnitude smaller than the improvement reported here.

Attribution

PR #1738 (@alertcat) base, PR #1735 (@AjAnubolu) parallel pre-quant TTT, PR #1729 (@romeerp) CaseOps tokenizer, PR #1493 (@bigbag) QK-Gain 5.25, PR #1412 (@Robby955) parallel residuals, PR #1331 (@dexhunter) depth recurrence, PR #1394 (@clarkkev) SP8192 + GPTQ SDClip.

…02767 (3-seed mean)

Two-hparam retune of PR openai#1738's pre-quant TTT phase:
- PREQUANT_TTT_LR: 5e-4 -> 1e-3 (TTT was undertrained; epoch 21 loss still descending)
- PREQUANT_TTT_FREEZE_BLOCKS: 2 -> 0 (freezing unnecessary at 21-epoch budget)

No architecture, tokenizer, or main-training changes. train_gpt.py is PR openai#1738's
file with two os.environ.setdefault lines prepended.

3-seed mean 1.02767 vs PR openai#1738 1.03540 -> Delta 0.00773 nats, p ~ 0.005 (t=9.7, df=2).
All artifacts under 16 MB, all runs under 600 s train + 600 s eval.
PR openai#1738's packed train_gpt.py crashes on pytorch 2.5.1 with 'FlashAttention
only supports fp16, bf16, and fp8_e4m3' because q/k/v can arrive as fp32 after
torch.compile passes. Replace with packed variant that includes the bf16 cast
around flash_attn_3_func. Same byte_count category (~25 KB), no WD_TAPER,
same functional code path. Matches the binary actually used to produce the
train_seed43/44/45 logs. Artifact stays under 16 MB.
….00025 (p<0.001)

Previous logs came from a functionally-equivalent packed variant that differed
by 37 bytes (contained dead WD_TAPER code). This commit replaces all three
seed logs with runs produced by the exact train_gpt.py committed in this folder
(Code size: 24,893 bytes in all logs). New stats:

- seed 43: 1.02846 (15,999,201 bytes)
- seed 44: 1.02812 (15,993,435 bytes)
- seed 45: 1.02861 (15,999,551 bytes)
- mean 1.02840, std 0.00025
- t-test vs 1.03040: t=13.8, df=2, p<0.001
- Delta vs PR openai#1738 = 0.00700 nats

Mean shifted up 0.00073 vs earlier logs (different vast.ai machine, same
pytorch 2.5.1+cu124) but std halved, so statistical confidence is stronger.
@kilojoules kilojoules changed the title Record: PR #1738 + PreQuant TTT LR=1e-3 + Unfrozen — val_bpb 1.02767 (3-seed mean) Record: PR #1738 + PreQuant TTT LR=1e-3 + Unfrozen — val_bpb 1.02840 (3-seed mean) Apr 21, 2026
sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Apr 21, 2026
…uant TTT; Recurrence Depth Curriculum; Parcae stable loops

- SOTA 1.0810 still holds (Day 12 plateau, longest in competition history)
- PR openai#1758 (1.02840): pre-quant TTT — 6th attempt at illegal pattern, ignore
- PR openai#1756 (1.06505): CaseOps + Recurrence Depth Curriculum (depth 1→3→4); has BOS bug; awaits Issue openai#1604
- PR openai#1755 (1.07462): CaseOps + Legal TTT; awaits Issue openai#1604
- New paper: Parcae (arXiv:2604.12946) — stable looped LMs via spectral norm constraint on injection params, relevant to Triple Loop stability
- New paper: Gated Attention (arXiv:2505.06708, NeurIPS 2025) — backs PR openai#1667 Attention Output Gate
- Added Session 18 lessons learned; Issue openai#1604 self-deadline Apr 24
- Primary action: implement PR openai#1586+openai#1667 immediately (9 days to deadline)

https://claude.ai/code/session_0151v7YeWWUSnhmcC8U8NGUV
@kilojoules
Copy link
Copy Markdown
Author

Legality concern (self-disclosed)

This PR builds on PR #1735's pre_quant_adamw_ttt mechanism. On 2026-04-21 I read PR #1735's review thread where @dexhunter flagged this mechanism as potentially violating Issue #1017 Condition 3 (score-before-update) / Track B (no TTT on val tokens that will be scored), and the PR #1735 author @AjAnubolu indicated he would revise to score-first per-chunk if the mechanism is deemed illegal. Our PR inherits the exact same mechanism — we only retuned two hyperparameters (PREQUANT_TTT_LR 5e-4 → 1e-3, PREQUANT_TTT_FREEZE_BLOCKS 2 → 0).

I did not catch this before opening the PR; flagging proactively now.

If OpenAI staff rules pre_quant_adamw_ttt (as implemented in PR #1735 / #1738) illegal under Track B, we acknowledge this PR is voided by the same ruling. In that case please reclassify under track_non_record_16mb, or close and we will resubmit either as a non-record submission or with a score-first legal rewrite.

The submission as currently written claims a record only if the underlying PR #1735 / #1738 mechanism is accepted as legal; otherwise it is not a valid record submission. Sorry for the noise and thanks for the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant