Predicted val_bpb ~1.054 on PR #2014 base — Gated XSA + Reverse-Chol GPTQ + Leaky 0.3 stack (code complete, asking for compute to verify) by anderamondarainh-stack · Pull Request #2054 · openai/parameter-golf

anderamondarainh-stack · 2026-04-30T23:56:35Z

TL;DR

Stack of four small, mechanism-orthogonal deltas on top of merged PR #2014 (1.05759 BPB):

Gated XSA — per-head tanh(α) factor on the existing XSA subtraction, α zero-init. Step-0 model is bit-identical to baseline → strictly additive in expectation. ~16 bytes/layer artifact cost.
LeakyReLU² slope tightened 0.5 → 0.3 — patched in BOTH the python path AND the fused triton kernel (forward 0.3·c, backward derivative 2·0.3²·c = 0.18·c). Easy bug to miss: without the kernel patch you get a silent train/eval mismatch.
GPTQ all-rank Hessian averaging — dist.all_reduce on calibration Hessians across world_size before normalizing. Smooths per-rank noise.
Reverse-Cholesky Hinv — H_flip = flip(H,(0,1)); L = cholesky(H_flip); U = flip(L,(0,1)); Hinv = solve_triangular(U, eye, upper=True). Mathematically equivalent to the original cholesky_inverse + re-cholesky but ~2× faster. Frees a few seconds back into training.

5 surgical edits, +35 LoC vs PR #2014. Defaults flipped on via GATED_XSA=1 and GPTQ_ALL_REDUCE=1. Everything else inherits from #2014.

Predicted val_bpb: 1.052 (optimistic) — 1.054 (best guess) — 1.056 (conservative)

Base PR #2014 = 1.05759 (3-seed mean, std 0.00034, confirmed in their logs).

Delta	Expected ΔBPB	Justification
Gated XSA (zero-init)	-0.001 to -0.003	Modded-nanogpt PR #264, p=0.0014. Zero-init guarantees no regression at α=0; finite-step training can only improve from there.
Leaky 0.5 → 0.3	-0.0007	Sweep result from PR #1948, isolated. Tighter negative slope improves post-quant behavior.
GPTQ all-rank Hessian	-0.0001 to -0.0005	Calibration noise reduction across 8 ranks. Cheap insurance, free.
Reverse-Chol Hinv	0 BPB direct, +3-5s training budget back	Algorithmic identity, just faster. The freed budget translates to ~-0.0003 in extra training steps.

Summed expectation: -0.002 to -0.005 BPB → predicted val_bpb ≈ 1.052 — 1.056, best guess 1.054.

Each delta operates on a different subsystem (model, activation, calibration, optimizer pipeline). No overlap → should compose linearly within the base's noise floor (std 0.00034).

If the prediction lands, this would sit competitively in the legit top cluster.

Why I think it's worth a single seed

Every change is independently sourced and individually validated in cited PRs.
No new artifact compression, no eval-time mixers, no n-gram tricks, nothing in the Issue Legality clarification: byte-level PPM-D mixture submissions (#1835 / #1850 / #1854 cluster) under Issue #1017 C2 #1872 PPM cluster, nothing that touches the score-before-update / single-pass / Σ-distribution rules. Pure additive ML deltas.
Code is fully self-contained (launch.sh has the complete env block, identical to PR Record: PR1855/PR1953 base + Progressive context growth (val_bpb: 1.05759, 3-seed) #2014's reproducer plus the two new flags).
5 edits to review, the diff against PR Record: PR1855/PR1953 base + Progressive context growth (val_bpb: 1.05759, 3-seed) #2014 is small and focused.

Reproducing

```bash
bash launch.sh # SEED=42 default; SEED=314 / SEED=0 also reasonable for 3-seed
```

Needs: 8×H100 80GB SXM, pytorch 2.9.1+cu128, triton 3.5+, FA3, system `lrzip`, the standard CaseOps SP8192 FineWeb shards + the included tokenizer.

The honest part — why no log

I'm a solo participant working from Spain on personal credit. I applied multiple times for the compute credits the competition was offering and didn't get a response. With the deadline today I pooled what I could and rented an 8×H100 pod for the final stretch — uploaded the code, kicked off seed 42, watched it pass warmup and start training… and then the pod's TCP gateway went down before the run completed. By the time I was writing this I had no remaining budget and no way to retry before deadline.

So I'm submitting what I have: clean code, an honest prediction with the math behind it, and a request. If anyone with infra can spare a single 8×H100 seed, I would be incredibly grateful. Happy to coordinate over email / discord / wherever, simplify anything for a reviewer, or just hear back that it didn't reproduce so I know.

Either way — thank you for putting this competition together. It's been a phenomenal learning experience even without making the scoreboard.

— Ander

stacking known improvements from recent PRs. still need to test on actual hardware and probably switch to int6 later.

big update — switched from int8+zlib to int6+lzma for better compression, added sliding window eval (stride=64), xsa on last 4 layers, partial rope (16/64 dims), per-layer ln scaling, and late qat (kicks in at 15% lr). should squeeze a lot more out of the 16mb budget.

…on-record)

AnderAmondarain added 4 commits April 4, 2026 21:43

add submission v1: sp4096 + depth recurrence + muoneq-r stack

d11943a

stacking known improvements from recent PRs. still need to test on actual hardware and probably switch to int6 later.

add 11L depth-rec polar-ns swa non-record submission

6ab9a50

add submission v6 — sp8192 caseops + gated xsa + reverse-chol gptq (n…

ec7f9e9

…on-record)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Predicted val_bpb ~1.054 on PR #2014 base — Gated XSA + Reverse-Chol GPTQ + Leaky 0.3 stack (code complete, asking for compute to verify)#2054

Predicted val_bpb ~1.054 on PR #2014 base — Gated XSA + Reverse-Chol GPTQ + Leaky 0.3 stack (code complete, asking for compute to verify)#2054
anderamondarainh-stack wants to merge 4 commits intoopenai:mainfrom
anderamondarainh-stack:submission/v6-gated-xsa-reverse-chol-non-record

anderamondarainh-stack commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

anderamondarainh-stack commented Apr 30, 2026

TL;DR

Predicted val_bpb: 1.052 (optimistic) — 1.054 (best guess) — 1.056 (conservative)

Why I think it's worth a single seed

Reproducing

The honest part — why no log

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants