Skip to content

Record: Polar Express NS + MIN_LR + GatedAttn + Alpha LoRA — val_bpb 1.07006 (3-seed mean)#1792

Open
renqianluo wants to merge 1 commit intoopenai:mainfrom
renqianluo:record/polar-minlr-1.07006
Open

Record: Polar Express NS + MIN_LR + GatedAttn + Alpha LoRA — val_bpb 1.07006 (3-seed mean)#1792
renqianluo wants to merge 1 commit intoopenai:mainfrom
renqianluo:record/polar-minlr-1.07006

Conversation

@renqianluo
Copy link
Copy Markdown

Summary

Stacks three independently-validated improvements from other authors onto our PR #1768:

  1. Polar Express NS coefficients (ported from PR Record: SP4096 + Polar Express + MuonEq-R + Depth Recurrence — 1.0923 BPB (3-seed) #1344) — 5 per-iteration minimax-optimal (a, b, c) Newton-Schulz tuples instead of the single fixed (3.4445, -4.775, 2.0315) applied 5 times. Higher-quality polar factor at unchanged MUON_BACKEND_STEPS=5.
  2. MIN_LR=0.10 warmdown floor (from @nprime06 PR Record: PR #1736 + Polar Express NS + MIN_LR + Sparse Attn Gate + Fused CE + PR #1767 TTT — val_bpb 1.06335 #1787) — final ~25% of training keeps delivering useful gradient updates.
  3. Tight budget polish (from @nprime06 PR Record: PR #1736 + Polar Express NS + MIN_LR + Sparse Attn Gate + Fused CE + PR #1767 TTT — val_bpb 1.06335 #1787): GPTQ_RESERVE_SECONDS=0.5 (vs 4.0) + VAL_LOSS_EVERY=0 reclaim ~15s for extra training steps.

Trajectory

Seed PR #1767 PR #1768 This
1337 1.07189 1.07146 1.07027
42 1.07248 1.07014 1.06964
314 1.07189 1.07082 1.07026
Mean 1.07209 1.07081 1.07006

Every seed improves monotonically across each change.

Compliance

Train 599.6s (all 3), eval 474–481s, artifact 15.98MB. Issue #1017 conditions 1–4 verified.

Attribution

…1.07006 (3-seed mean)

Stacks three independently-validated improvements onto PR openai#1768:
1. Polar Express NS coefficients (ported from PR openai#1344) — 5 per-iteration minimax-optimal
   Newton-Schulz tuples instead of the single fixed (3.4445,-4.775,2.0315) tuple.
2. MIN_LR=0.10 warmdown floor (from PR openai#1787 @nprime06) — final 25% of training keeps
   delivering useful gradient updates.
3. Tight budget polish (from PR openai#1787 @nprime06): GPTQ_RESERVE_SECONDS=0.5 (vs 4.0) +
   VAL_LOSS_EVERY=0 reclaim ~15s for extra depth-3 training steps.

3-seed mean 1.07006 BPB (seeds 1337, 42, 314). All seeds improve monotonically over PR openai#1768.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant