Record: SP8192 PR #1874 + Optimized Hyperparameters — val_bpb 1.06844 (3-seed mean) by bigbag · Pull Request #1926 · openai/parameter-golf

bigbag · 2026-04-29T12:01:36Z

Summary

val_bpb = 1.06844 (3-seed mean, std 0.00058) | val_loss = 2.75955 nats/token | ~15.95 MB | 8×H100 SXM, 600s train / 600s eval

Runs PR #1874 (AjAnubolu) verbatim code with optimized hyperparameters via environment variables. No code modifications — all improvements come from properly activating PR #1874's intended settings.

Key env vars: MIN_LR=0.10 (prevents LR collapse to zero in warmdown), QK_GAIN_INIT=5.25, GATE_ATTN_WIDTH=24, GPTQ_RESERVE_SECONDS=0.5 (maximizes training steps), VAL_LOSS_EVERY=0 (eliminates mid-train eval overhead).

No CaseOps, no casefold, no PPM — standard SP8192 UTF-8 byte counting throughout.

3-Seed Results (8×H100 80GB SXM, phased TTT, 10-min train / 10-min eval budgets)

Seed	Steps	Pre-quant post-EMA	Quantized	Post-TTT	Artifact (bytes)	train_time	eval_time
1337	4834	1.06960	1.07925	1.06798	15,950,405	599.64s	409.6s
42	4844	1.06984	1.07948	1.06824	15,952,215	599.65s	421.0s
2025	4837	1.07060	1.08028	1.06909	15,948,755	599.56s	381.2s
Mean		1.07001	1.07967	1.06844	15,950,458	599.62s	403.9s
Std		0.00053	0.00053	0.00058	1,730	0.05s	20.2s

All three seeds clear size (< 16,000,000 B), train-time (< 600s), and eval-time (< 600s) budgets.

Key Techniques (all from PR #1874, activated via env vars)

LQER Asymmetric Rank-4 (PR Record: PR #1787 base + Smear Gate + LQER Asym — val_bpb 1.06157 #1797) — SVD-based quantization error reduction
SmearGate + Attention Output Gate (width 24) (PR Record: SP8192 + SmearGate + AttnOutGate(w24) + LoRA-TTT Improvements + Phased TTT — val_bpb 1.06991 (3-seed mean) #1790) — per-layer smoothing + attention gating
Polar Express Newton-Schulz (PR Record: SP4096 + Polar Express + MuonEq-R + Depth Recurrence — 1.0923 BPB (3-seed) #1344) — minimax-tuned Muon coefficients
MIN_LR=0.10 (PR Record: PR #1736 + Polar Express NS + MIN_LR + Sparse Attn Gate + Fused CE + PR #1767 TTT — val_bpb 1.06335 #1787) — warmdown LR floor (prevents collapse to zero)
QK_GAIN_INIT=5.25 — per-head query-key attention scaling
GATE_ATTN_WIDTH=24 — doubled attention gate capacity
GPTQ_RESERVE_SECONDS=0.5 — maximizes training steps (+28 steps vs default 4.0)
VAL_LOSS_EVERY=0 — eliminates mid-training eval overhead (+112 steps)
Phased Score-First LoRA-TTT — 3-phase AdamW (rank 128)

How to reproduce

SEED=1337 MIN_LR=0.10 QK_GAIN_INIT=5.25 GATE_ATTN_WIDTH=24 \
  GPTQ_RESERVE_SECONDS=0.5 VAL_LOSS_EVERY=0 \
  torchrun --standalone --nproc_per_node=8 train_gpt.py

Rule Compliance

Score-first phased TTT (no re-scoring)
No pre-quant TTT on validation data
No n-gram cache, no PPM
No CaseOps, no casefold
Artifact ≤ 16,000,000 bytes (max 15,952,215)
Train ≤ 600s (max 599.65s), eval ≤ 600s (max 421.0s)

Attribution

Built on PR #1874 (AjAnubolu), PR #1790 (miaoyuxun), PR #1344 (Polar Express), PR #1787 (nprime06), PR #1797 (dexhunter).

Test plan

Reproduce any seed with env vars from "How to reproduce" section
Verify artifact < 16,000,000 bytes in each seed log
Verify eval_time < 600s in each seed log
Verify score-first TTT ordering in code

🤖 Generated with Claude Code

…3-seed mean) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…1.06990) Added MIN_LR=0.10, QK_GAIN_INIT=5.25, GATE_ATTN_WIDTH=24, GPTQ_RESERVE_SECONDS=0.5, VAL_LOSS_EVERY=0. Same code, env vars only. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

bigbag · 2026-04-29T12:01:45Z

Thanks to OpenAI's Advanced Competitor grant ($500 compute credit via RunPod) for making this work possible.

Pavel Liashkov and others added 2 commits April 29, 2026 15:39

Record: SP8192 PR openai#1874 + TTT_CHUNK_SIZE=32 — val_bpb 1.06990 (…

e81b630

…3-seed mean) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Update: optimized env vars — val_bpb 1.06844 (3-seed mean, down from …

efd6cae

…1.06990) Added MIN_LR=0.10, QK_GAIN_INIT=5.25, GATE_ATTN_WIDTH=24, GPTQ_RESERVE_SECONDS=0.5, VAL_LOSS_EVERY=0. Same code, env vars only. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: SP8192 PR #1874 + Optimized Hyperparameters — val_bpb 1.06844 (3-seed mean)#1926

Record: SP8192 PR #1874 + Optimized Hyperparameters — val_bpb 1.06844 (3-seed mean)#1926
bigbag wants to merge 2 commits intoopenai:mainfrom
bigbag:submission/pr1874-chunk32

bigbag commented Apr 29, 2026

Uh oh!

bigbag commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bigbag commented Apr 29, 2026

Summary

3-Seed Results (8×H100 80GB SXM, phased TTT, 10-min train / 10-min eval budgets)

Key Techniques (all from PR #1874, activated via env vars)

How to reproduce

Rule Compliance

Attribution

Test plan

Uh oh!

bigbag commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant