Skip to content

Predicted val_bpb ~1.054 on PR #2014 base — Gated XSA + Reverse-Chol GPTQ + Leaky 0.3 stack (code complete, asking for compute to verify)#2054

Open
anderamondarainh-stack wants to merge 4 commits intoopenai:mainfrom
anderamondarainh-stack:submission/v6-gated-xsa-reverse-chol-non-record
Open

Predicted val_bpb ~1.054 on PR #2014 base — Gated XSA + Reverse-Chol GPTQ + Leaky 0.3 stack (code complete, asking for compute to verify)#2054
anderamondarainh-stack wants to merge 4 commits intoopenai:mainfrom
anderamondarainh-stack:submission/v6-gated-xsa-reverse-chol-non-record

Conversation

@anderamondarainh-stack
Copy link
Copy Markdown

TL;DR

Stack of four small, mechanism-orthogonal deltas on top of merged PR #2014 (1.05759 BPB):

  • Gated XSA — per-head tanh(α) factor on the existing XSA subtraction, α zero-init. Step-0 model is bit-identical to baseline → strictly additive in expectation. ~16 bytes/layer artifact cost.
  • LeakyReLU² slope tightened 0.5 → 0.3 — patched in BOTH the python path AND the fused triton kernel (forward 0.3·c, backward derivative 2·0.3²·c = 0.18·c). Easy bug to miss: without the kernel patch you get a silent train/eval mismatch.
  • GPTQ all-rank Hessian averagingdist.all_reduce on calibration Hessians across world_size before normalizing. Smooths per-rank noise.
  • Reverse-Cholesky HinvH_flip = flip(H,(0,1)); L = cholesky(H_flip); U = flip(L,(0,1)); Hinv = solve_triangular(U, eye, upper=True). Mathematically equivalent to the original cholesky_inverse + re-cholesky but ~2× faster. Frees a few seconds back into training.

5 surgical edits, +35 LoC vs PR #2014. Defaults flipped on via GATED_XSA=1 and GPTQ_ALL_REDUCE=1. Everything else inherits from #2014.

Predicted val_bpb: 1.052 (optimistic) — 1.054 (best guess) — 1.056 (conservative)

Base PR #2014 = 1.05759 (3-seed mean, std 0.00034, confirmed in their logs).

Delta Expected ΔBPB Justification
Gated XSA (zero-init) -0.001 to -0.003 Modded-nanogpt PR #264, p=0.0014. Zero-init guarantees no regression at α=0; finite-step training can only improve from there.
Leaky 0.5 → 0.3 -0.0007 Sweep result from PR #1948, isolated. Tighter negative slope improves post-quant behavior.
GPTQ all-rank Hessian -0.0001 to -0.0005 Calibration noise reduction across 8 ranks. Cheap insurance, free.
Reverse-Chol Hinv 0 BPB direct, +3-5s training budget back Algorithmic identity, just faster. The freed budget translates to ~-0.0003 in extra training steps.

Summed expectation: -0.002 to -0.005 BPB → predicted val_bpb ≈ 1.052 — 1.056, best guess 1.054.

Each delta operates on a different subsystem (model, activation, calibration, optimizer pipeline). No overlap → should compose linearly within the base's noise floor (std 0.00034).

If the prediction lands, this would sit competitively in the legit top cluster.

Why I think it's worth a single seed

Reproducing

```bash
bash launch.sh # SEED=42 default; SEED=314 / SEED=0 also reasonable for 3-seed
```

Needs: 8×H100 80GB SXM, pytorch 2.9.1+cu128, triton 3.5+, FA3, system `lrzip`, the standard CaseOps SP8192 FineWeb shards + the included tokenizer.

The honest part — why no log

I'm a solo participant working from Spain on personal credit. I applied multiple times for the compute credits the competition was offering and didn't get a response. With the deadline today I pooled what I could and rented an 8×H100 pod for the final stretch — uploaded the code, kicked off seed 42, watched it pass warmup and start training… and then the pod's TCP gateway went down before the run completed. By the time I was writing this I had no remaining budget and no way to retry before deadline.

So I'm submitting what I have: clean code, an honest prediction with the math behind it, and a request. If anyone with infra can spare a single 8×H100 seed, I would be incredibly grateful. Happy to coordinate over email / discord / wherever, simplify anything for a reviewer, or just hear back that it didn't reproduce so I know.

Either way — thank you for putting this competition together. It's been a phenomenal learning experience even without making the scoreboard.

— Ander

stacking known improvements from recent PRs. still need to
test on actual hardware and probably switch to int6 later.
big update — switched from int8+zlib to int6+lzma for better compression,
added sliding window eval (stride=64), xsa on last 4 layers, partial
rope (16/64 dims), per-layer ln scaling, and late qat (kicks in at 15%
lr). should squeeze a lot more out of the 16mb budget.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants