Predicted val_bpb ~1.054 on PR #2014 base — Gated XSA + Reverse-Chol GPTQ + Leaky 0.3 stack (code complete, asking for compute to verify)#2054
Open
anderamondarainh-stack wants to merge 4 commits intoopenai:mainfrom
Conversation
stacking known improvements from recent PRs. still need to test on actual hardware and probably switch to int6 later.
big update — switched from int8+zlib to int6+lzma for better compression, added sliding window eval (stride=64), xsa on last 4 layers, partial rope (16/64 dims), per-layer ln scaling, and late qat (kicks in at 15% lr). should squeeze a lot more out of the 16mb budget.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TL;DR
Stack of four small, mechanism-orthogonal deltas on top of merged PR #2014 (1.05759 BPB):
tanh(α)factor on the existing XSA subtraction, α zero-init. Step-0 model is bit-identical to baseline → strictly additive in expectation. ~16 bytes/layer artifact cost.0.3·c, backward derivative2·0.3²·c = 0.18·c). Easy bug to miss: without the kernel patch you get a silent train/eval mismatch.dist.all_reduceon calibration Hessians across world_size before normalizing. Smooths per-rank noise.Hinv—H_flip = flip(H,(0,1)); L = cholesky(H_flip); U = flip(L,(0,1)); Hinv = solve_triangular(U, eye, upper=True). Mathematically equivalent to the originalcholesky_inverse + re-choleskybut ~2× faster. Frees a few seconds back into training.5 surgical edits, +35 LoC vs PR #2014. Defaults flipped on via
GATED_XSA=1andGPTQ_ALL_REDUCE=1. Everything else inherits from #2014.Predicted val_bpb: 1.052 (optimistic) — 1.054 (best guess) — 1.056 (conservative)
Base PR #2014 = 1.05759 (3-seed mean, std 0.00034, confirmed in their logs).
Summed expectation: -0.002 to -0.005 BPB → predicted val_bpb ≈ 1.052 — 1.056, best guess 1.054.
Each delta operates on a different subsystem (model, activation, calibration, optimizer pipeline). No overlap → should compose linearly within the base's noise floor (std 0.00034).
If the prediction lands, this would sit competitively in the legit top cluster.
Why I think it's worth a single seed
launch.shhas the complete env block, identical to PR Record: PR1855/PR1953 base + Progressive context growth (val_bpb: 1.05759, 3-seed) #2014's reproducer plus the two new flags).Reproducing
```bash
bash launch.sh # SEED=42 default; SEED=314 / SEED=0 also reasonable for 3-seed
```
Needs: 8×H100 80GB SXM, pytorch 2.9.1+cu128, triton 3.5+, FA3, system `lrzip`, the standard CaseOps SP8192 FineWeb shards + the included tokenizer.
The honest part — why no log
I'm a solo participant working from Spain on personal credit. I applied multiple times for the compute credits the competition was offering and didn't get a response. With the deadline today I pooled what I could and rented an 8×H100 pod for the final stretch — uploaded the code, kicked off seed 42, watched it pass warmup and start training… and then the pod's TCP gateway went down before the run completed. By the time I was writing this I had no remaining budget and no way to retry before deadline.
So I'm submitting what I have: clean code, an honest prediction with the math behind it, and a request. If anyone with infra can spare a single 8×H100 seed, I would be incredibly grateful. Happy to coordinate over email / discord / wherever, simplify anything for a reviewer, or just hear back that it didn't reproduce so I know.
Either way — thank you for putting this competition together. It's been a phenomenal learning experience even without making the scoreboard.
— Ander