openai · GodlyDonuts · Apr 28, 2026 · Apr 28, 2026
diff --git a/...track_non_record_16mb/2026-04-28_NewtonMuon_DocPacking_NegativeResult/README.md b/...track_non_record_16mb/2026-04-28_NewtonMuon_DocPacking_NegativeResult/README.md
@@ -0,0 +1,183 @@
+# Newton-Muon × Document-Packing — Negative Result
+
+**Result: technique strongly REGRESSES (+0.0378 nat) under PR #1874's document-packed loader.** Filed as a non-record submission with full diagnostic logs so other competitors don't burn compute on the same dead end.
+
+> **What this submission is, in one paragraph.**
+> An honest negative result. We tried to graft hook-based Newton–Schulz residual orthogonalization ("Newton-Muon") onto PR #1874's full stack as part of our weekend's exploration. It regressed val_bpb by +0.0378 nat in a controlled same-seed A/B. Rather than bury the result, we are filing it here with the full diagnostic logs and a root-cause analysis. The same-seed reproduction baseline — produced by the same `train_gpt.py` source with `NEWTON_MUON_ENABLED=0` — is included for direct comparison and is the same configuration as the reproduction in our companion record submission.
+
+---
+
+## TL;DR
+
+| Config (seed=42, identical data and hyperparameters otherwise) | quantized + TTT-phased val_bpb |
+|--------|-------------------------------:|
+| PR #1874 baseline                                | **1.06928** |
+| PR #1874 + `NEWTON_MUON_ENABLED=1`               | **1.10705** |
+| **Δ**                                            | **+0.0378 nat (worse)** |
+
+Both runs use identical seed, dataset, batch size, step count, and hardware. The only delta is the `NEWTON_MUON_ENABLED=1` env var.
+
+---
+
+## Root Cause
+
+The Newton-Muon implementation we tested uses a **forward-pre-hook on every Linear module** to accumulate a per-module integer counter `_nm_K_count` and trigger Newton–Schulz preconditioning every K-th forward pass. This is incompatible with PR #1874's training pipeline because:
+
+1. **Document packing produces variable `cu_seqlens` per step.** PR #1874's loader concatenates documents into mixed-length sequences and passes per-step `cu_seqlens` into FlashAttention 3. Each unique `cu_seqlens` shape is already one source of dynamo specialization.
+
+2. **`_nm_K_count` is a Python int attribute.** `torch._dynamo` treats integer attributes on `nn.Module` as **static** — it specializes on their value. Every step the hook does `module._nm_K_count += n`, which dynamo sees as a new value, triggering a new graph specialization.
+
+3. **The recompile limit is hit within ~10 steps.** On 8 ranks, each transformer block hits `config.recompile_limit=16` almost immediately. From [`train_nm_default.log`](train_nm_default.log):
+
+   ```
+   torch._dynamo hit config.recompile_limit (16)
+       function: 'forward'
+       last reason: 0/15: self._modules['blocks']._modules['0']._modules['attn']
+                          ._modules['attn_gate_proj']._nm_K_count == 1474560
+   ```
+
+4. **Cascade failure.** Dynamo falls back to eager for the affected blocks → `fullgraph=True` is silently violated → FlashAttention 3 fused kernels stop emitting cleanly → step time inflates ~2.4× → fewer steps fit in the 600s budget → final val_bpb regresses by +0.0378 nat.
+
+The PyTorch hint in the warning is on point:
+
+> HINT: torch.compile considers integer attributes of the nn.Module to be static. If you are observing recompilation, you might want to make this integer dynamic using `torch._dynamo.config.allow_unspec_int_on_nn_module = True`, or convert this integer into a tensor.
+
+We tried `allow_unspec_int_on_nn_module = True` early in development. It suppressed the recompile warnings but the underlying graph fragmentation persisted because dynamo still has to reconcile the variable-length `cu_seqlens` against a now-dynamic counter, and FA3's specialized paths get bypassed regardless.
+
+---
+
+## Why This Is Worth Filing
+
+PR #1874's stack is currently the best public score on the SP8192 track in absolute terms (1.06766 single-seed; we reproduced it independently to 1.06907 single-seed and 1.06996 3-seed mean). Newton-Muon-style optimizers are a frequently-suggested next step in leaderboard discussions and in PR #1900's threads. Anyone porting Newton-Muon onto PR #1874 — or onto any base that uses a document-packing loader with variable `cu_seqlens` — will hit this same wall.
+
+The fix is non-trivial. It probably requires moving the K-counter and the preconditioning trigger out of the compiled region entirely (e.g. into the `optimizer.step()` boundary, not a forward hook), which is enough of a redesign that we did not attempt it within our compute budget.
+
+---
+
+## What's Compatible
+
+- **Static-shape forward** (single fixed seq_len, no document packing): the hook approach likely works there because `cu_seqlens` doesn't change and `_nm_K_count` becomes the only specialization source. We did not test this, but `allow_unspec_int_on_nn_module = True` should cover it.
+- **Optimizer-step-based preconditioning** (not hook-based): triggered from `optimizer.step()`, runs outside any compiled region. The suggested fix above. We did not implement it.
+
+---
+
+## Reproduction
+
+Identical to the record submission's reproduction steps; the only delta is the env var:
+
+```bash
+# Negative-result run (Newton-Muon enabled)
+NEWTON_MUON_ENABLED=1 NEWTON_MUON_CAPTURE_EVERY=4 \
+  SEED=42 \
+  torchrun --standalone --nproc_per_node=8 train_gpt.py 2>&1 | tee train_nm_default.log
+
+# Baseline (PR #1874 stack, NM disabled — identical seed, identical everything else)
+SEED=42 torchrun --standalone --nproc_per_node=8 train_gpt.py 2>&1 | tee train_baseline_seed42.log
+```
+
+The diff between the two runs is exactly one env var.
+
+---
+
+## Files
+
+- `README.md` (this file)
+- `submission.json` — machine-readable metadata, including the diagnostic log lines
+- `train_gpt.py` — PR #1874 source with the Newton-Muon graft (134 KB unwrapped, kept readable for the root-cause analysis)
+- `train_nm_default.log` — full Newton-Muon run, val_bpb 1.10705
+- `train_nm_smoke.log` — short capture run that surfaces the dynamo recompile diagnostics in detail
+- `train_baseline_seed42.log` — PR #1874 baseline (NM disabled), val_bpb 1.06928. Identical seed and data — direct A/B comparison.
+- `models/` — pre-trained `.int6.ptz` artifacts so a reviewer can eval-only without retraining (see "Eval-only verification" below).
+
+### `models/` directory
+
+| File | What it is | Size | Reported val_bpb |
+|------|-----------|-----:|-----------------:|
+| `models/nm_default.int6.ptz`              | Newton-Muon enabled, full 600s training, seed=42 | 15,928,150 B | 1.10705 |
+| `models/nm_smoke.int6.ptz`                | Newton-Muon enabled, short 180s smoke run        | 15,943,987 B | (smoke; not the headline number) |
+| `models/baseline_pr1874_seed42.int6.ptz`  | PR #1874 baseline, NM disabled, seed=42 (the A/B comparison artifact) | 15,921,161 B | 1.06928 |
+
+These are checked into the submission so any reviewer can inspect the trained artifacts directly without having to retrain. Including model artifacts is not standard practice on this leaderboard; we're including them here because the value of a negative-result submission is "anyone can verify the failure mode," and shipping the artifacts gives the reviewer two independent ways to do that (re-run the script, or eval the shipped weights).
+
+### How to use the shipped artifacts
+
+The included `train_gpt.py` does **not** ship with an explicit `EVAL_ONLY` flag — its pipeline is `train → quantize → eval` end-to-end (the int6 artifact is written to `final_model.int6.ptz` near the bottom of training, then loaded and eval'd by the same process). To eval one of the shipped `.int6.ptz` artifacts without retraining, point the script at it via the `final_model.int6.ptz` filename it expects, and use the existing `deserialize(h, device)` helper at `train_gpt.py:2139`. A minimal harness looks like:
+
+```python
+# eval_shipped_artifact.py — sketch, not shipped
+import shutil, train_gpt as TG
+shutil.copy("models/nm_default.int6.ptz", "final_model.int6.ptz")
+h = TG.Hyperparameters()  # same env-var-driven config as training
+TG.set_logging_hparams(h)
+device = TG.setup_distributed()
+eval_model = TG.deserialize(h, device)
+TG.run_sliding_eval(eval_model, h, device)         # sliding-window eval
+TG.run_phased_ttt_eval(eval_model, h, device)      # phased TTT eval
+```
+
+(Function names follow PR #1874's structure; exact entrypoints may need adjusting against the source.) For most reviewers, **re-running `train_gpt.py` from scratch on a fresh seed=42 is the simpler verification path**, since the script is already wired end-to-end and the regression is large (+0.0378 nat) and stable. The artifacts in `models/` are primarily archival evidence of the runs that produced the reported logs.
+
+### Direct inspection without GPUs (verified)
+
+The `.int6.ptz` files are produced by PR #1874's `serialize()` (in `train_gpt.py`): a torch-saved `{"w": <quant_result>, "m": <quant_meta>}` dict, byte-shuffled with stride 2, then brotli-compressed. To read on CPU you need to undo those two steps in reverse:
+
+```python
+# verified on 2026-04-28 against models/nm_default.int6.ptz
+import brotli, io, torch, numpy as np
+
+_BSHF_MAGIC = b"BSHF"
+
+def _byte_unshuffle(data):  # mirrors train_gpt.py:_byte_unshuffle
+    if len(data) < 5 or data[:4] != _BSHF_MAGIC:
+        return data
+    stride = data[4]
+    if stride < 2:
+        return data[5:]
+    payload = np.frombuffer(data, dtype=np.uint8, offset=5)
+    n = len(payload)
+    out = np.empty(n, dtype=np.uint8)
+    src_off = 0
+    for pos in range(stride):
+        chunk_len = (n - pos + stride - 1) // stride
+        out[pos::stride] = payload[src_off:src_off + chunk_len]
+        src_off += chunk_len
+    return out.tobytes()
+
+with open("models/nm_default.int6.ptz", "rb") as f:
+    raw = brotli.decompress(f.read())
+state = torch.load(io.BytesIO(_byte_unshuffle(raw)), map_location="cpu", weights_only=False)
+
+print(list(state.keys()))                          # ['w', 'm']
+print(len(state["w"]), "quantized tensor entries") # 207
+print(list(state["w"].keys())[:4])
+# ['blocks.0.attn.c_q.weight.q',
+#  'blocks.0.attn.c_q.weight.scale',
+#  'blocks.0.attn.proj.weight.q',
+#  'blocks.0.attn.proj.weight.scale']
+print(list(state["m"].items())[:1])
+# [('blocks.0.attn.c_q.weight', 'gptq (int6)')]
+```
+
+This is enough to confirm the artifacts are well-formed int6 GPTQ-quantized models with the expected layer structure on a laptop, no GPU required. The byte-shuffle step (`_byte_shuffle`/`_byte_unshuffle` at `train_gpt.py:1976-2002`) is part of PR #1874's compression pipeline, not something we added.
+
+---
+
+## Compute Cost of This Negative Result
+
+~$12 of 8×H100 SXM time on RunPod (one full 600 s training run + one short 180 s smoke run + diagnostic capture). Posted publicly so the next person doesn't repeat it.
+
+---
+
+## Hardware
+
+8 × H100 80 GB SXM (RunPod), PyTorch 2.9.1 + CUDA 12.8, FlashAttention 3 (`cu128_torch291` wheel).
+
+---
+
+## Acknowledgements
+
+Thanks to **@AjAnubolu** for [PR #1874](https://github.com/openai/parameter-golf/pull/1874) (the base stack we grafted onto). Newton-Muon idea credit goes to the broader Newton–Schulz orthogonalization literature in the leaderboard discussion.
+
+Submitted by:
+- **Saicharan Ramineni** ([@GodlyDonuts](https://github.com/GodlyDonuts))
+- csramineni@gmail.com
diff --git a/...mb/2026-04-28_NewtonMuon_DocPacking_NegativeResult/models/baseline_pr1874_seed42.int6.ptz b/...mb/2026-04-28_NewtonMuon_DocPacking_NegativeResult/models/baseline_pr1874_seed42.int6.ptz
diff --git a/...on_record_16mb/2026-04-28_NewtonMuon_DocPacking_NegativeResult/models/nm_default.int6.ptz b/...on_record_16mb/2026-04-28_NewtonMuon_DocPacking_NegativeResult/models/nm_default.int6.ptz
diff --git a/..._non_record_16mb/2026-04-28_NewtonMuon_DocPacking_NegativeResult/models/nm_smoke.int6.ptz b/..._non_record_16mb/2026-04-28_NewtonMuon_DocPacking_NegativeResult/models/nm_smoke.int6.ptz
diff --git a/...ds/track_non_record_16mb/2026-04-28_NewtonMuon_DocPacking_NegativeResult/requirements.txt b/...ds/track_non_record_16mb/2026-04-28_NewtonMuon_DocPacking_NegativeResult/requirements.txt
@@ -0,0 +1,12 @@
+numpy
+tqdm
+torch
+huggingface-hub
+kernels
+setuptools
+typing-extensions==4.15.0
+datasets
+tiktoken
+sentencepiece
+brotli
+zstandard
diff --git a/...rds/track_non_record_16mb/2026-04-28_NewtonMuon_DocPacking_NegativeResult/submission.json b/...rds/track_non_record_16mb/2026-04-28_NewtonMuon_DocPacking_NegativeResult/submission.json
@@ -0,0 +1,42 @@
+{
+  "author": "Saicharan Ramineni",
+  "github_id": "GodlyDonuts",
+  "name": "Newton-Muon × Document-Packing — Negative Result",
+  "blurb": "Grafting hook-based Newton–Schulz residual orthogonalization (Newton-Muon) onto PR #1874's full stack regresses val_bpb by +0.0378 nat in a controlled same-seed A/B. Root cause is dynamo recompile fragmentation: the per-module integer counter `_nm_K_count` is mutated inside a forward-pre-hook, dynamo treats it as a static attribute, every transformer block hits config.recompile_limit=16 within ~10 steps, fullgraph compilation silently breaks, FA3 fused kernels stop emitting cleanly, and step time inflates ~2.4×. Filed as a non-record submission with full diagnostic logs so other competitors don't repeat the dead end.",
+  "date": "2026-04-28",
+  "track": "non_record_16mb",
+  "result": "negative",
+  "result_summary": {
+    "newton_muon_enabled_val_bpb": 1.10705440,
+    "pr_1874_baseline_same_seed_val_bpb": 1.06927777,
+    "regression_nats": 0.03777663,
+    "regression_direction": "Newton-Muon REGRESSES vs baseline by +0.0378 nat",
+    "controlled_for": "identical seed=42, identical SP8192 dataset, identical batch size, identical step count, identical hardware. Only delta is NEWTON_MUON_ENABLED=1."
+  },
+  "root_cause_summary": "Per-module Python int `_nm_K_count` updated inside a forward-pre-hook is treated as a static attribute by torch._dynamo. It mutates every step → every block recompiles → hits config.recompile_limit=16 within ~10 steps → dynamo falls back to eager → fullgraph=True is silently violated → FA3 fused attention kernels stop emitting cleanly → ~2.4× wall-clock slowdown → fewer training steps fit in the 600s budget → final val_bpb regresses by +0.0378 nat.",
+  "diagnostic_evidence": {
+    "log_lines": [
+      "torch._dynamo hit config.recompile_limit (16)",
+      "function: 'forward'",
+      "last reason: 0/15: self._modules['blocks']._modules['0']._modules['attn']._modules['attn_gate_proj']._nm_K_count == 1474560"
+    ],
+    "pytorch_hint": "torch.compile considers integer attributes of the nn.Module to be static. If you are observing recompilation, you might want to make this integer dynamic using torch._dynamo.config.allow_unspec_int_on_nn_module = True, or convert this integer into a tensor.",
+    "we_tried_the_hint": "Setting allow_unspec_int_on_nn_module = True suppressed the recompile warnings but the underlying graph fragmentation persisted, because dynamo still has to reconcile the variable-length cu_seqlens (from PR #1874's document-packed loader) against a now-dynamic counter, and FA3's specialized paths get bypassed regardless."
+  },
+  "compatibility": {
+    "compatible_with_static_seq_len_loaders": "likely yes (untested in this submission)",
+    "compatible_with_pr_1874_document_packing_loader": false,
+    "suggested_fixes": [
+      "Convert _nm_K_count to a torch.Tensor scalar.",
+      "Move K-accumulation outside the compiled region (e.g. trigger preconditioning from optimizer.step() rather than from a forward hook).",
+      "Apply preconditioning post-reduce-scatter on a non-compiled path."
+    ]
+  },
+  "hardware": "8 × H100 80GB SXM (RunPod)",
+  "pytorch_version": "2.9.1+cu128",
+  "compute_spent_usd": 12,
+  "attribution": {
+    "newton_schulz_optimizer_family": "broad community / leaderboard discussion",
+    "base_stack_we_grafted_onto": "@AjAnubolu (PR #1874)"
+  }
+}