Newton-Muon × PR #1874's document-packed loader: a controlled negative result#1907
Open
GodlyDonuts wants to merge 2 commits intoopenai:mainfrom
Open
Newton-Muon × PR #1874's document-packed loader: a controlled negative result#1907GodlyDonuts wants to merge 2 commits intoopenai:mainfrom
GodlyDonuts wants to merge 2 commits intoopenai:mainfrom
Conversation
… nat) Grafting hook-based Newton-Schulz residual orthogonalization onto PR openai#1874's full stack regresses val_bpb by +0.0378 nat in a controlled same-seed A/B. Root cause is dynamo recompile fragmentation: the per-module integer counter `_nm_K_count` is mutated inside a forward-pre-hook, dynamo treats integer attributes on nn.Module as static, every transformer block hits config.recompile_limit=16 within ~10 steps, fullgraph compilation silently breaks, FA3 fused kernels stop emitting cleanly, and step time inflates ~2.4x. Filed with full diagnostic logs (train_nm_default.log, train_nm_smoke.log) and a same-seed PR openai#1874 baseline (train_baseline_seed42.log) for direct comparison, so other competitors don't burn the same compute. Author: Saicharan Ramineni <[email protected]>
…he README - models/nm_default.int6.ptz — Newton-Muon enabled, full 600s, seed=42 (val_bpb 1.10705) - models/nm_smoke.int6.ptz — Newton-Muon enabled, 180s smoke run - models/baseline_pr1874_seed42.int6.ptz — PR openai#1874 baseline, NM disabled, seed=42 (val_bpb 1.06928) This lets a reviewer inspect the actual trained quantized weights without having to retrain. Total ~46 MB of binary artifacts. README updated with a verified CPU-only inspection snippet (brotli decompress -> _byte_unshuffle -> torch.load) demonstrating the artifacts are well-formed int6 GPTQ state dicts. Author: Saicharan Ramineni <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Newton-Muon × PR #1874's Document-Packed Loader: A Controlled Negative Result
When PR #1874 landed and started showing
val_bpbin the 1.067 band, follow-up leaderboard discussions (including threads under PR #1900) pointed at Newton–Schulz-style residual orthogonalization in the optimizer — "Newton-Muon" — as a plausible next bump on top of the Polar Express NS already used in #1874. We spent a weekend trying it. This PR documents what we found, with full logs and trained artifacts, in case it saves others the same compute.Headline
In a strict same-seed (42), same-data, same-batch-size, same-step-count A/B against the PR #1874 stack, enabling Newton-Muon regressed quantized + TTT-phased
val_bpbfrom 1.06928 → 1.10705 (a +0.0378 nat hit). This is not noise or a lack of hyperparameter sweeping; it is a flat and reproducible regression. Both runs used the sametrain_gpt.pyincluded here; the only delta was the environment variableNEWTON_MUON_ENABLED.Why it Regresses (The Root Cause)
The Newton-Muon implementation we evaluated installs forward-pre-hooks on every Linear module to track a per-module integer counter$K$ -th forward pass. While this works on a vanilla static-shape loop, it fails on the PR #1874 stack for the following reasons:
_nm_K_count, triggering a Newton–Schulz preconditioning step everytorch._dynamo.config.recompile_limit=16.torch._dynamotreats integer attributes onnn.Moduleas static values. Every step, the hook incrementsmodule._nm_K_count. Dynamo sees this as a new value and emits a new specialization for every single step.cu_seqlensshape. These two specialization axes together saturate the recompile budget almost immediately.fullgraph=Trueis violated, and FlashAttention 3's fused paths stop emitting cleanly. Step time inflates roughly 2.4×, meaning fewer training steps fit inside the 600s budget.The PyTorch hint (
allow_unspec_int_on_nn_module = True) suppresses the log warnings but does not fix the underlying fragmentation. Dynamo still struggles to reconcile the dynamic integer against variablecu_seqlens, and FA3 kernels do not survive the transition.Structural Fixes
To resolve this, the$K$ -counter and preconditioning trigger must be moved out of the compiled region:
optimizer.step()rather than a forward hook._nm_K_countto atorch.Tensorscalar held outsidenn.Module.__dict__.What is in the Folder
train_gpt.py: PR Record: SP8192 + Polar Express NS + MIN_LR + LQER Asym Rank-4 — val_bpb 1.06766 (3-seed mean) #1874 source with the Newton-Muon graft. Toggle viaNEWTON_MUON_ENABLED={0,1}.train_nm_default.log: Full Newton-Muon run (val_bpb1.10705).train_nm_smoke.log: Short run surfacing the dynamo recompile diagnostics.train_baseline_seed42.log: PR Record: SP8192 + Polar Express NS + MIN_LR + LQER Asym Rank-4 — val_bpb 1.06766 (3-seed mean) #1874 baseline with NM disabled (val_bpb1.06928).models/: Three quantized artifacts (~16 MB each) for CPU inspection.submission.json,requirements.txt,README.md: Full metadata and results write-up.Conclusion
This is not a bug in any individual component, but rather a predictable interaction between hook-based integer state, document-packed variable shapes, and fullgraph compilation. We are filing this as a non-record submission to ensure the community doesn't "burn" compute on this specific implementation path.
Compute spent: ~$12 on 8×H100 SXM (RunPod).
Submitted by: [Saicharan Ramineni (GodlyDonuts)](https://github.com/GodlyDonuts)