Skip to content

Record: Fused softcap CE + WD=2.0 (warm-start stability fix) — val_bpb 1.06957 (3-seed mean)#1886

Open
renqianluo wants to merge 1 commit intoopenai:mainfrom
renqianluo:record/fused-ce-wd2-1.06957
Open

Record: Fused softcap CE + WD=2.0 (warm-start stability fix) — val_bpb 1.06957 (3-seed mean)#1886
renqianluo wants to merge 1 commit intoopenai:mainfrom
renqianluo:record/fused-ce-wd2-1.06957

Conversation

@renqianluo
Copy link
Copy Markdown

Summary

Stacks @nprime06's Fused softcap CE Triton kernel (PR #1787) on top of our PR #1768. Discovered a divergence interaction — Fused CE's fp32-accumulation differences combined with warm-start A push seeds 314/1337 into a TTT attractor that collapses to ~1.121 BPB at WD=1.0, while seed 42 trains cleanly.

Novel: WD=2.0 unlocks the combination

Raising TTT_WEIGHT_DECAY from 1.0 → 2.0 regularizes the cross-batch A drift enough that all 3 seeds stay stable while keeping the warm-start gain.

Config Seed 42 Seed 314 Seed 1337
Fused CE + warm + WD=1.0 1.06923 1.12144 (collapse) 1.12145 (collapse)
Fused CE + no-warm + WD=1.0 1.07078 1.07078 1.07162
Fused CE + warm + WD=2.0 (this) 1.06920 1.06942 1.07010

Results

Seed PR #1768 This Delta
1337 1.07146 1.07010 -0.00136
42 1.07014 1.06920 -0.00094
314 1.07082 1.06942 -0.00140
Mean 1.07081 1.06957 -0.00124

Compliance: train ≤599.7s, eval 393–478s, artifact 15.98MB. Issue #1017 conditions 1–4 verified.

Hardware

Trained on RunPod 8xH100 80GB SXM (PyTorch 2.9.1+cu128, FA3, Triton 3.5.1). Identical SP8192 tokenizer and FineWeb selection as upstream willdepueoai/parameter-golf.

Attribution

@nprime06 (PR #1787 — Fused CE), PR #1344 (Polar Express NS), @dexhunter (PR #1736 — GatedAttn), @samacqua (PR #1530 — VarLen+FusedMLP+doc-LoRA), @bigbag (PR #1493), @renqianluo (PR #1767, PR #1768).

…d mean)

Stacks Fused CE Triton kernel (from @nprime06 PR openai#1787) onto our PR openai#1768.
Discovered that fused CE introduces fp32-accumulation differences that, combined
with warm-start LoRA A, cause TTT divergence on seeds 314/1337 at WD=1.0
(both collapse to ~1.121).

Novel fix: raise TTT_WEIGHT_DECAY 1.0 -> 2.0. This regularizes the cross-batch
A drift enough that all 3 seeds stay stable AND keep the warm-start gain.

3-seed mean 1.06957 BPB (−0.00124 vs PR openai#1768's 1.07081). Each seed individually
beats its PR openai#1768 counterpart.

Trained on RunPod 8xH100 instead of Zoom cluster. Identical SP8192 tokenizer
and FineWeb selection as upstream willdepueoai/parameter-golf.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant