Record: XSA-all + GPTQ + FA3 dtype fix — val_bpb 1.1161 (3-seed mean)#1494
Record: XSA-all + GPTQ + FA3 dtype fix — val_bpb 1.1161 (3-seed mean)#1494G3sparky wants to merge 3 commits intoopenai:mainfrom
Conversation
11L XSA on all layers + Full Hessian GPTQ + BigramHash 3072x112 + EMA. Key contribution: FA3 dtype compatibility wrapper enabling Hopper attention on PyTorch 2.5.1 without auto-casting. 8xH100 SXM, 600s, 6244 steps @ 96ms/step.
There was a problem hiding this comment.
Pull request overview
Adds a new 10min/16MB record submission folder for “XSA-all + GPTQ + FA3 dtype fix”, including the exact training log, the full training/quantization script used to produce the artifact, and accompanying submission metadata/README for reproducibility.
Changes:
- Introduces an FA3 dtype compatibility wrapper to enable Flash Attention 3 Hopper kernels under PyTorch 2.5.1.
- Includes the full training + full-Hessian GPTQ (AR self-gen calibration) pipeline script used for the run.
- Adds record metadata (submission.json) and documentation (README) alongside the captured train.log.
Reviewed changes
Copilot reviewed 2 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| records/track_10min_16mb/2026-04-09_XSA_GPTQ_FA3_dtype_fix/train.log | Captured stdout/logs for the submitted run (steps, timing, eval metrics). |
| records/track_10min_16mb/2026-04-09_XSA_GPTQ_FA3_dtype_fix/train_gpt.py | Self-contained training + GPTQ quantization script; includes FA3 dtype wrapper and record pipeline. |
| records/track_10min_16mb/2026-04-09_XSA_GPTQ_FA3_dtype_fix/submission.json | Submission metadata for leaderboard/record tracking. |
| records/track_10min_16mb/2026-04-09_XSA_GPTQ_FA3_dtype_fix/README.md | Human-readable summary of results, architecture, and how to run. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| }, | ||
| "hardware": "8xH100 80GB SXM (Vast.ai)", | ||
| "pytorch_version": "2.5.1+cu124", | ||
| "cuda_version": "13.0", |
There was a problem hiding this comment.
cuda_version doesn’t match pytorch_version: PyTorch 2.5.1+cu124 is built against CUDA 12.4, so the JSON should likely say "cuda_version": "12.4" (or update the PyTorch build string if you truly used CUDA 13.x).
| "cuda_version": "13.0", | |
| "cuda_version": "12.4", |
There was a problem hiding this comment.
Good catch — the JSON should say cuda_version 12.4 to match cu124. Will fix in the next push.
| ## Hardware | ||
|
|
||
| - 8× NVIDIA H100 80GB HBM3 SXM (Vast.ai, Nebraska) | ||
| - PyTorch 2.5.1+cu124, CUDA 13.0 |
There was a problem hiding this comment.
The hardware bullet lists PyTorch 2.5.1+cu124, CUDA 13.0, which is internally inconsistent (cu124 corresponds to CUDA 12.4). Update the CUDA version here (or the PyTorch build string) to keep the record metadata accurate.
| - PyTorch 2.5.1+cu124, CUDA 13.0 | |
| - PyTorch 2.5.1+cu124, CUDA 12.4 |
There was a problem hiding this comment.
Agreed, will correct to CUDA 12.4 to match the PyTorch build tag.
| hessians[param_name] = torch.zeros(cols, cols, dtype=torch.float32, device='cpu') | ||
| def make_hook(pname): | ||
| def hook_fn(module, input, output): | ||
| x = input[0].detach().float() | ||
| if x.ndim == 3: | ||
| x = x.reshape(-1, x.shape[-1]) | ||
| hessians[pname] += (x.T @ x).cpu() | ||
| return hook_fn |
There was a problem hiding this comment.
Hessian accumulation is doing a GPU matmul and then .cpu() inside every forward hook (hessians[pname] += (x.T @ x).cpu()), which forces repeated device syncs and transfers. Consider accumulating the Hessians on GPU (or in a GPU buffer) and moving them to CPU once at the end to reduce overhead during calibration.
There was a problem hiding this comment.
Acknowledged. The Hessian accumulation on GPU with final .cpu() transfer is inherited from the upstream baseline. In practice the GPTQ calibration phase is < 12s on 8xH100 so the sync overhead is negligible, but keeping the Hessians on GPU until the end and doing a single bulk transfer would be cleaner. Will consider for the next iteration.
| quant_blob_disk = f.read() | ||
| quant_state = torch.load( | ||
| io.BytesIO(lzma.decompress(quant_blob_disk)), | ||
| map_location="cpu", |
There was a problem hiding this comment.
torch.load(...) is invoked with the default weights_only=False, which triggers PyTorch’s security FutureWarning and may become a behavioral change in future versions. If the serialized object is compatible, pass weights_only=True; otherwise consider explicitly allowlisting needed types via torch.serialization.add_safe_globals or refactoring the saved format to be weights-only.
| map_location="cpu", | |
| map_location="cpu", | |
| weights_only=True, |
There was a problem hiding this comment.
Fair point. Adding weights_only=True where applicable. The deserialization path only loads tensor data so weights_only=True is safe and removes the FutureWarning.
Community Review — Record: XSA-all + GPTQ + FA3 dtype fix (val_bpb: 1.1220)BPB: 1.1220 | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache What I found in the code (head SHA Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.06s, dim=512, layers=11, vocab=1024, code=102083 B, SMOKE_TEST_PASS Verdict: LOOKS CLEAN. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline. Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually. Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.06s, dim=512, layers=11, vocab=1024, code=102083 B, SMOKE_TEST_PASS. Classification via deterministic AST-based |
- Added seeds 42, 314, 999 (replaced single seed 1337) - Mean val_bpb 1.1161 (std 0.0009), all artifacts under 16MB - Fixed author: Gavin Saunders (@G3sparky) - Updated PyTorch/CUDA versions to match actual run environment - Verified by Lauren (team QA)
|
Hi @cocohearts — this PR is ready for review. All Copilot feedback addressed (CUDA version fixed, 3-seed results added, torch.load safety noted). 3-seed mean: 1.1161 (std 0.0009). All artifacts under 16MB. Thanks! |
|
Superseded by #1858 (Neural-Only val_bpb 1.0810, 3-seed mean — ties leaderboard leader). Closing as the canonical entry has moved on. Thanks! |
Record Submission
val_bpb: 1.1220 (sliding window, stride=64) | ~15.9 MB artifact | 8×H100 SXM, 600s
Key Contribution
FA3 dtype compatibility wrapper — enables Flash Attention 3 Hopper kernels on PyTorch 2.5.1 which lacks auto-casting for FA3 calls. Simple 5-line wrapper that casts to bf16 when needed.
Architecture
Built on the PR #1019 stack:
Results
8×H100 80GB SXM (Vast.ai), PyTorch 2.5.1+cu124, Flash Attention 3 compiled from source.
Authors: Gavin Saunders & Tron