Skip to content

Add 2026-04-23_SP8192_PerLayerClip_UnfrozenTTT_1.0849#1794

Open
Programmerryoki wants to merge 1 commit intoopenai:mainfrom
Programmerryoki:add-2026-04-23-pgolf-ultimate
Open

Add 2026-04-23_SP8192_PerLayerClip_UnfrozenTTT_1.0849#1794
Programmerryoki wants to merge 1 commit intoopenai:mainfrom
Programmerryoki:add-2026-04-23-pgolf-ultimate

Conversation

@Programmerryoki
Copy link
Copy Markdown

Summary

New submission at records/track_10min_16mb/2026-04-23_SP8192_PerLayerClip_UnfrozenTTT_1.0849/.

3-seed mean: 1.0849 BPB (std 0.00022) on seeds 4 / 30 / 2026. 8xH100 SXM. 600s train + 600s eval. All artifacts <= 16,000,000 bytes. Legal under Issue #1017 Conditions 1-4.

Seed Sliding BPB TTT BPB Artifact bytes
4 1.08671 1.08463 15,961,223
30 1.08721 1.08504 15,963,763
2026 1.08699 1.08498 15,963,896
Mean 1.08697 1.08488
Std 0.00025 0.00022

Merged SOTA (PR #1493 @bigbag): 1.0810 BPB. This submission lands at +0.0039 vs merged SOTA — not a new record, contributed for the two new primitives and a recipe the community can sweep from.

Contributions vs PR #1735 base

  • Per-layer GPTQ clip sigmas — MLP=12.0, attn=13.5, emb=15.0 (vs prior uniform 12.85). Tuned for per-category outlier distributions.
  • Unfrozen score-first TTTTTT_FREEZE_BLOCKS=0 TTT_LR=0.010 TTT_EPOCHS=5 — all non-embedding blocks adapt, with LR tuned higher than the usual 0.005.
  • Eval wall-clock budget guard — times score-only and adaptation costs separately (post a 5-chunk warmup). Truncates adaptation at the cosine-LR tail when the projected total would exceed MAX_EVAL_SECONDS=600. Scoring continues for every remaining chunk (legality preserved). Decision rank-synced via dist.all_reduce(MAX) to keep NCCL in lockstep.

Inline citations

Base architecture from PR #1735 (@Grad62304977). SpinQuant (PR #1695 @dexhunter). GPTQ + SDClip (PR #1394 @clarkkev). Attention output gate (PR #1667 @Grad62304977). Score-first TTT pattern (PR #549 @abaybektursun). Full citation list in README.md.

Test plan

  • All 3 seeds completed 600s train + 600s eval without exceeding wall-clock caps (eval:total reported in each log, ranges 553.2s–557.4s)
  • All 3 artifacts under 16,000,000 bytes (max 15,963,896)
  • Score-before-update ordering verified in eval_val_ttt: each val chunk scored under torch.no_grad() before any SGD step touches the weights
  • Hessian calibration + GPTQ quantization run inside the 600s training budget, not eval
  • Per-seed logs included at train_seed{4,30,2026}.log; quantized_ttt val_bpb: is the reported BPB

3-seed mean 1.0849 BPB (std 0.00022) on seeds 4/30/2026. 8xH100 SXM.
Training 600s + eval 600s, all artifacts <= 16,000,000 bytes.
Legal under Issue openai#1017 Conditions 1-4.

Contributions vs PR openai#1735 base:
- Per-layer GPTQ clip sigmas (MLP=12.0, attn=13.5, emb=15.0)
- Unfrozen score-first TTT (TTT_FREEZE_BLOCKS=0, TTT_LR=0.010, TTT_EPOCHS=5)
- Eval wall-clock budget guard that truncates TTT adaptation at the
  cosine-LR tail when approaching 600s cap; scoring continues for every
  remaining chunk (legality preserved).

Other techniques (cited inline in README): SpinQuant (PR openai#1695),
int7 token embedding, attention output gate (PR openai#1667), score-first
TTT pattern (PR openai#549).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant