Record: SP8192 PR #1874 + TTT_CHUNK_SIZE=32 — val_bpb 1.06990 (3-seed mean) by bigbag · Pull Request #1920 · openai/parameter-golf

bigbag · 2026-04-29T08:40:18Z

Summary

val_bpb = 1.06990 (3-seed mean, std 0.00025) | val_loss = 2.76367 nats/token | ~15.95 MB | 8×H100 SXM, 600s train / 600s eval

Runs PR #1874 (AjAnubolu) verbatim with TTT_CHUNK_SIZE=32 (default 48). Smaller TTT chunks give more gradient updates per document during phased LoRA-TTT evaluation, providing −0.0003 BPB improvement over default chunk size.

No CaseOps, no casefold, no PPM — standard SP8192 UTF-8 byte counting throughout.

3-Seed Results (8×H100 80GB SXM, phased TTT, 10-min train / 10-min eval budgets)

Seed	Steps	Pre-quant post-EMA	Quantized	Post-TTT	Artifact (bytes)	train_time	eval_time
1337	4842	1.07132	1.08122	1.06985	15,943,571	596.06s	595.5s
42	4842	1.07169	1.08151	1.07017	15,950,196	596.12s	560.8s
2025	4839	1.07134	1.08107	1.06968	15,946,736	596.11s	556.1s
Mean		1.07145	1.08127	1.06990	15,946,834	596.10s	570.8s
Std		0.00021	0.00022	0.00025	3,314	0.03s	21.5s

All three seeds clear size, train-time, and eval-time budgets.

Key Techniques (all from PR #1874)

LQER Asymmetric Rank-4 (PR Record: PR #1787 base + Smear Gate + LQER Asym — val_bpb 1.06157 #1797) — SVD-based quantization error reduction
SmearGate + Attention Output Gate (width 24) (PR Record: SP8192 + SmearGate + AttnOutGate(w24) + LoRA-TTT Improvements + Phased TTT — val_bpb 1.06991 (3-seed mean) #1790) — per-layer smoothing + attention gating
Polar Express Newton-Schulz (PR Record: SP4096 + Polar Express + MuonEq-R + Depth Recurrence — 1.0923 BPB (3-seed) #1344) — minimax-tuned Muon coefficients
MIN_LR=0.10 (PR Record: PR #1736 + Polar Express NS + MIN_LR + Sparse Attn Gate + Fused CE + PR #1767 TTT — val_bpb 1.06335 #1787) — warmdown LR floor
Phased Score-First LoRA-TTT — 3-phase AdamW (rank 128)
TTT_CHUNK_SIZE=32 — finer-grained TTT adaptation (our addition)

How to reproduce

SEED=1337 TTT_CHUNK_SIZE=32 torchrun --standalone --nproc_per_node=8 train_gpt.py

Rule Compliance

Score-first phased TTT (no re-scoring)
No pre-quant TTT on validation data
No n-gram cache, no PPM
No CaseOps, no casefold
Artifact ≤ 16,000,000 bytes (max 15,950,196)
Train ≤ 600s, eval ≤ 600s

Attribution

Built on PR #1874 (AjAnubolu), PR #1790 (miaoyuxun), PR #1344 (Polar Express), PR #1787 (nprime06), PR #1797 (dexhunter).

Test plan

Reproduce any seed with SEED=<N> TTT_CHUNK_SIZE=32 torchrun --standalone --nproc_per_node=8 train_gpt.py
Verify artifact < 16,000,000 bytes in each seed log
Verify score-first TTT ordering in code

🤖 Generated with Claude Code

…3-seed mean) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

bigbag · 2026-04-29T08:40:29Z

Thanks to OpenAI's Advanced Competitor grant ($500 compute credit via RunPod) for making this work possible.

…1.06990) Added MIN_LR=0.10, QK_GAIN_INIT=5.25, GATE_ATTN_WIDTH=24, GPTQ_RESERVE_SECONDS=0.5, VAL_LOSS_EVERY=0. Same code, env vars only. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

pliashkov-spm · 2026-04-29T12:00:09Z

UPDATE: val_bpb improved from 1.06990 → 1.06844 (3-seed mean)

Fixed hyperparameter defaults to match PR #1874's intended configuration:

Env var	Old (default)	New	Impact
`MIN_LR`	0.0	0.10	Prevents LR collapse to zero in warmdown
`QK_GAIN_INIT`	5.0	5.25	PR #1874's attention scaling
`GATE_ATTN_WIDTH`	12	24	PR #1874's gate capacity
`GPTQ_RESERVE_SECONDS`	4.0	0.5	+28 training steps
`VAL_LOSS_EVERY`	4000	0	+112 training steps (no mid-train eval)

Same code (PR #1874 verbatim), env vars only. All 3 seeds under 600s train AND eval time.

Seed	Post-TTT BPB	Eval time	Artifact
1337	1.06798	409.6s	15,950,405 B
42	1.06824	421.0s	15,952,215 B
2025	1.06909	381.2s	15,948,755 B
Mean	1.06844 (std 0.00058)	403.9s

Reproduce: SEED=1337 MIN_LR=0.10 QK_GAIN_INIT=5.25 GATE_ATTN_WIDTH=24 GPTQ_RESERVE_SECONDS=0.5 VAL_LOSS_EVERY=0 torchrun --standalone --nproc_per_node=8 train_gpt.py

bigbag · 2026-04-29T12:01:08Z

Superseded by new PR with updated results (1.06844 3-seed mean, improved from 1.06990).

Record: SP8192 PR openai#1874 + TTT_CHUNK_SIZE=32 — val_bpb 1.06990 (…

e81b630

…3-seed mean) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

G3sparky mentioned this pull request Apr 29, 2026

SP8192 + Score-First TTT + QK-Gain 5.25 — Neural-Only val_bpb 1.0810 (3-seed mean) #1858

Open

Update: optimized env vars — val_bpb 1.06844 (3-seed mean, down from …

efd6cae

…1.06990) Added MIN_LR=0.10, QK_GAIN_INIT=5.25, GATE_ATTN_WIDTH=24, GPTQ_RESERVE_SECONDS=0.5, VAL_LOSS_EVERY=0. Same code, env vars only. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

bigbag closed this Apr 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: SP8192 PR #1874 + TTT_CHUNK_SIZE=32 — val_bpb 1.06990 (3-seed mean)#1920

Record: SP8192 PR #1874 + TTT_CHUNK_SIZE=32 — val_bpb 1.06990 (3-seed mean)#1920
bigbag wants to merge 2 commits intoopenai:mainfrom
bigbag:submission/pr1874-chunk32

bigbag commented Apr 29, 2026

Uh oh!

bigbag commented Apr 29, 2026

Uh oh!

pliashkov-spm commented Apr 29, 2026

Uh oh!

bigbag commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bigbag commented Apr 29, 2026

Summary

3-Seed Results (8×H100 80GB SXM, phased TTT, 10-min train / 10-min eval budgets)

Key Techniques (all from PR #1874)

How to reproduce

Rule Compliance

Attribution

Test plan

Uh oh!

bigbag commented Apr 29, 2026

Uh oh!

pliashkov-spm commented Apr 29, 2026

UPDATE: val_bpb improved from 1.06990 → 1.06844 (3-seed mean)

Uh oh!

bigbag commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants