Add 2026-04-23_SP8192_PerLayerClip_UnfrozenTTT_1.0849 by Programmerryoki · Pull Request #1794 · openai/parameter-golf

Programmerryoki · 2026-04-23T22:32:12Z

Summary

New submission at records/track_10min_16mb/2026-04-23_SP8192_PerLayerClip_UnfrozenTTT_1.0849/.

3-seed mean: 1.0849 BPB (std 0.00022) on seeds 4 / 30 / 2026. 8xH100 SXM. 600s train + 600s eval. All artifacts <= 16,000,000 bytes. Legal under Issue #1017 Conditions 1-4.

Seed	Sliding BPB	TTT BPB	Artifact bytes
4	1.08671	1.08463	15,961,223
30	1.08721	1.08504	15,963,763
2026	1.08699	1.08498	15,963,896
Mean	1.08697	1.08488	—
Std	0.00025	0.00022	—

Merged SOTA (PR #1493 @bigbag): 1.0810 BPB. This submission lands at +0.0039 vs merged SOTA — not a new record, contributed for the two new primitives and a recipe the community can sweep from.

Contributions vs PR #1735 base

Per-layer GPTQ clip sigmas — MLP=12.0, attn=13.5, emb=15.0 (vs prior uniform 12.85). Tuned for per-category outlier distributions.
Unfrozen score-first TTT — TTT_FREEZE_BLOCKS=0 TTT_LR=0.010 TTT_EPOCHS=5 — all non-embedding blocks adapt, with LR tuned higher than the usual 0.005.
Eval wall-clock budget guard — times score-only and adaptation costs separately (post a 5-chunk warmup). Truncates adaptation at the cosine-LR tail when the projected total would exceed MAX_EVAL_SECONDS=600. Scoring continues for every remaining chunk (legality preserved). Decision rank-synced via dist.all_reduce(MAX) to keep NCCL in lockstep.

Inline citations

Base architecture from PR #1735 (@Grad62304977). SpinQuant (PR #1695 @dexhunter). GPTQ + SDClip (PR #1394 @clarkkev). Attention output gate (PR #1667 @Grad62304977). Score-first TTT pattern (PR #549 @abaybektursun). Full citation list in README.md.

Test plan

All 3 seeds completed 600s train + 600s eval without exceeding wall-clock caps (eval:total reported in each log, ranges 553.2s–557.4s)
All 3 artifacts under 16,000,000 bytes (max 15,963,896)
Score-before-update ordering verified in eval_val_ttt: each val chunk scored under torch.no_grad() before any SGD step touches the weights
Hessian calibration + GPTQ quantization run inside the 600s training budget, not eval
Per-seed logs included at train_seed{4,30,2026}.log; quantized_ttt val_bpb: is the reported BPB

3-seed mean 1.0849 BPB (std 0.00022) on seeds 4/30/2026. 8xH100 SXM. Training 600s + eval 600s, all artifacts <= 16,000,000 bytes. Legal under Issue openai#1017 Conditions 1-4. Contributions vs PR openai#1735 base: - Per-layer GPTQ clip sigmas (MLP=12.0, attn=13.5, emb=15.0) - Unfrozen score-first TTT (TTT_FREEZE_BLOCKS=0, TTT_LR=0.010, TTT_EPOCHS=5) - Eval wall-clock budget guard that truncates TTT adaptation at the cosine-LR tail when approaching 600s cap; scoring continues for every remaining chunk (legality preserved). Other techniques (cited inline in README): SpinQuant (PR openai#1695), int7 token embedding, attention output gate (PR openai#1667), score-first TTT pattern (PR openai#549).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add 2026-04-23_SP8192_PerLayerClip_UnfrozenTTT_1.0849#1794

Add 2026-04-23_SP8192_PerLayerClip_UnfrozenTTT_1.0849#1794
Programmerryoki wants to merge 1 commit intoopenai:mainfrom
Programmerryoki:add-2026-04-23-pgolf-ultimate

Programmerryoki commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Programmerryoki commented Apr 23, 2026

Summary

Contributions vs PR #1735 base

Inline citations

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant