Record: Fused softcap CE + WD=2.0 (warm-start stability fix) — val_bpb 1.06957 (3-seed mean) by renqianluo · Pull Request #1886 · openai/parameter-golf

renqianluo · 2026-04-28T07:03:03Z

Summary

Stacks @nprime06's Fused softcap CE Triton kernel (PR #1787) on top of our PR #1768. Discovered a divergence interaction — Fused CE's fp32-accumulation differences combined with warm-start A push seeds 314/1337 into a TTT attractor that collapses to ~1.121 BPB at WD=1.0, while seed 42 trains cleanly.

Novel: WD=2.0 unlocks the combination

Raising TTT_WEIGHT_DECAY from 1.0 → 2.0 regularizes the cross-batch A drift enough that all 3 seeds stay stable while keeping the warm-start gain.

Config	Seed 42	Seed 314	Seed 1337
Fused CE + warm + WD=1.0	1.06923	1.12144 (collapse)	1.12145 (collapse)
Fused CE + no-warm + WD=1.0	1.07078	1.07078	1.07162
Fused CE + warm + WD=2.0 (this)	1.06920	1.06942	1.07010

Results

Seed	PR #1768	This	Delta
1337	1.07146	1.07010	-0.00136
42	1.07014	1.06920	-0.00094
314	1.07082	1.06942	-0.00140
Mean	1.07081	1.06957	-0.00124

Compliance: train ≤599.7s, eval 393–478s, artifact 15.98MB. Issue #1017 conditions 1–4 verified.

Hardware

Trained on RunPod 8xH100 80GB SXM (PyTorch 2.9.1+cu128, FA3, Triton 3.5.1). Identical SP8192 tokenizer and FineWeb selection as upstream willdepueoai/parameter-golf.

Attribution

@nprime06 (PR #1787 — Fused CE), PR #1344 (Polar Express NS), @dexhunter (PR #1736 — GatedAttn), @samacqua (PR #1530 — VarLen+FusedMLP+doc-LoRA), @bigbag (PR #1493), @renqianluo (PR #1767, PR #1768).

@nprime06

…d mean) Stacks Fused CE Triton kernel (from @nprime06 PR openai#1787) onto our PR openai#1768. Discovered that fused CE introduces fp32-accumulation differences that, combined with warm-start LoRA A, cause TTT divergence on seeds 314/1337 at WD=1.0 (both collapse to ~1.121). Novel fix: raise TTT_WEIGHT_DECAY 1.0 -> 2.0. This regularizes the cross-batch A drift enough that all 3 seeds stay stable AND keep the warm-start gain. 3-seed mean 1.06957 BPB (−0.00124 vs PR openai#1768's 1.07081). Each seed individually beats its PR openai#1768 counterpart. Trained on RunPod 8xH100 instead of Zoom cluster. Identical SP8192 tokenizer and FineWeb selection as upstream willdepueoai/parameter-golf.

MarioPaerle mentioned this pull request Apr 29, 2026

Record: PR #1886 base + per-block MLP output gate (Linear, weight-learnable) — val_bpb 1.06872 (3-seed mean) #1941

Closed

This was referenced May 1, 2026

Record: SP8192 CaseOps v13 PPM tuned gate — fresh 3-seed mean 0.94175270 #2083

Open

Record: BIJEPAX-lite JEPA + SP8192 CaseOps PPM — val_bpb 0.97271 #2080

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Fused softcap CE + WD=2.0 (warm-start stability fix) — val_bpb 1.06957 (3-seed mean)#1886

Record: Fused softcap CE + WD=2.0 (warm-start stability fix) — val_bpb 1.06957 (3-seed mean)#1886
renqianluo wants to merge 1 commit intoopenai:mainfrom
renqianluo:record/fused-ce-wd2-1.06957

renqianluo commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

renqianluo commented Apr 28, 2026

Summary

Novel: WD=2.0 unlocks the combination

Results

Hardware

Attribution

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant