Skip to content

Record: XSA-all + GPTQ + FA3 dtype fix — val_bpb 1.1161 (3-seed mean)#1494

Closed
G3sparky wants to merge 3 commits intoopenai:mainfrom
G3sparky:submission/xsa-gptq-fa3-fix
Closed

Record: XSA-all + GPTQ + FA3 dtype fix — val_bpb 1.1161 (3-seed mean)#1494
G3sparky wants to merge 3 commits intoopenai:mainfrom
G3sparky:submission/xsa-gptq-fa3-fix

Conversation

@G3sparky
Copy link
Copy Markdown

@G3sparky G3sparky commented Apr 9, 2026

Record Submission

val_bpb: 1.1220 (sliding window, stride=64) | ~15.9 MB artifact | 8×H100 SXM, 600s

Key Contribution

FA3 dtype compatibility wrapper — enables Flash Attention 3 Hopper kernels on PyTorch 2.5.1 which lacks auto-casting for FA3 calls. Simple 5-line wrapper that casts to bf16 when needed.

Architecture

Built on the PR #1019 stack:

  • 11L XSA on all layers
  • Full Hessian GPTQ with AR self-gen calibration
  • BigramHash 3072×112
  • EMA (decay=0.997)
  • LeakyReLU(0.5)²

Results

Steps ms/step Sliding BPB (s64)
6,244 96ms 1.1220

8×H100 80GB SXM (Vast.ai), PyTorch 2.5.1+cu124, Flash Attention 3 compiled from source.

Authors: Gavin Saunders & Tron

11L XSA on all layers + Full Hessian GPTQ + BigramHash 3072x112 + EMA.
Key contribution: FA3 dtype compatibility wrapper enabling Hopper
attention on PyTorch 2.5.1 without auto-casting.

8xH100 SXM, 600s, 6244 steps @ 96ms/step.
Copilot AI review requested due to automatic review settings April 9, 2026 07:52
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new 10min/16MB record submission folder for “XSA-all + GPTQ + FA3 dtype fix”, including the exact training log, the full training/quantization script used to produce the artifact, and accompanying submission metadata/README for reproducibility.

Changes:

  • Introduces an FA3 dtype compatibility wrapper to enable Flash Attention 3 Hopper kernels under PyTorch 2.5.1.
  • Includes the full training + full-Hessian GPTQ (AR self-gen calibration) pipeline script used for the run.
  • Adds record metadata (submission.json) and documentation (README) alongside the captured train.log.

Reviewed changes

Copilot reviewed 2 out of 4 changed files in this pull request and generated 4 comments.

File Description
records/track_10min_16mb/2026-04-09_XSA_GPTQ_FA3_dtype_fix/train.log Captured stdout/logs for the submitted run (steps, timing, eval metrics).
records/track_10min_16mb/2026-04-09_XSA_GPTQ_FA3_dtype_fix/train_gpt.py Self-contained training + GPTQ quantization script; includes FA3 dtype wrapper and record pipeline.
records/track_10min_16mb/2026-04-09_XSA_GPTQ_FA3_dtype_fix/submission.json Submission metadata for leaderboard/record tracking.
records/track_10min_16mb/2026-04-09_XSA_GPTQ_FA3_dtype_fix/README.md Human-readable summary of results, architecture, and how to run.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

},
"hardware": "8xH100 80GB SXM (Vast.ai)",
"pytorch_version": "2.5.1+cu124",
"cuda_version": "13.0",
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cuda_version doesn’t match pytorch_version: PyTorch 2.5.1+cu124 is built against CUDA 12.4, so the JSON should likely say "cuda_version": "12.4" (or update the PyTorch build string if you truly used CUDA 13.x).

Suggested change
"cuda_version": "13.0",
"cuda_version": "12.4",

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — the JSON should say cuda_version 12.4 to match cu124. Will fix in the next push.

## Hardware

- 8× NVIDIA H100 80GB HBM3 SXM (Vast.ai, Nebraska)
- PyTorch 2.5.1+cu124, CUDA 13.0
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardware bullet lists PyTorch 2.5.1+cu124, CUDA 13.0, which is internally inconsistent (cu124 corresponds to CUDA 12.4). Update the CUDA version here (or the PyTorch build string) to keep the record metadata accurate.

Suggested change
- PyTorch 2.5.1+cu124, CUDA 13.0
- PyTorch 2.5.1+cu124, CUDA 12.4

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, will correct to CUDA 12.4 to match the PyTorch build tag.

Comment on lines +1116 to +1123
hessians[param_name] = torch.zeros(cols, cols, dtype=torch.float32, device='cpu')
def make_hook(pname):
def hook_fn(module, input, output):
x = input[0].detach().float()
if x.ndim == 3:
x = x.reshape(-1, x.shape[-1])
hessians[pname] += (x.T @ x).cpu()
return hook_fn
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hessian accumulation is doing a GPU matmul and then .cpu() inside every forward hook (hessians[pname] += (x.T @ x).cpu()), which forces repeated device syncs and transfers. Consider accumulating the Hessians on GPU (or in a GPU buffer) and moving them to CPU once at the end to reduce overhead during calibration.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged. The Hessian accumulation on GPU with final .cpu() transfer is inherited from the upstream baseline. In practice the GPTQ calibration phase is < 12s on 8xH100 so the sync overhead is negligible, but keeping the Hessians on GPU until the end and doing a single bulk transfer would be cleaner. Will consider for the next iteration.

quant_blob_disk = f.read()
quant_state = torch.load(
io.BytesIO(lzma.decompress(quant_blob_disk)),
map_location="cpu",
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch.load(...) is invoked with the default weights_only=False, which triggers PyTorch’s security FutureWarning and may become a behavioral change in future versions. If the serialized object is compatible, pass weights_only=True; otherwise consider explicitly allowlisting needed types via torch.serialization.add_safe_globals or refactoring the saved format to be weights-only.

Suggested change
map_location="cpu",
map_location="cpu",
weights_only=True,

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. Adding weights_only=True where applicable. The deserialization path only loads tensor data so weights_only=True is safe and removes the FutureWarning.

@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Record: XSA-all + GPTQ + FA3 dtype fix (val_bpb: 1.1220)

BPB: 1.1220 | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

What I found in the code (head SHA 7a94e720c629, file records/track_10min_16mb/2026-04-09_XSA_GPTQ_FA3_dtype_fix/train_gpt.py):

Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.06s, dim=512, layers=11, vocab=1024, code=102083 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline.

Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.06s, dim=512, layers=11, vocab=1024, code=102083 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

- Added seeds 42, 314, 999 (replaced single seed 1337)
- Mean val_bpb 1.1161 (std 0.0009), all artifacts under 16MB
- Fixed author: Gavin Saunders (@G3sparky)
- Updated PyTorch/CUDA versions to match actual run environment
- Verified by Lauren (team QA)
@G3sparky G3sparky changed the title Record: XSA-all + GPTQ + FA3 dtype fix (val_bpb: 1.1220) Record: XSA-all + GPTQ + FA3 dtype fix — val_bpb 1.1161 (3-seed mean) Apr 18, 2026
Copy link
Copy Markdown
Author

@G3sparky G3sparky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed

@G3sparky
Copy link
Copy Markdown
Author

Hi @cocohearts — this PR is ready for review. All Copilot feedback addressed (CUDA version fixed, 3-seed results added, torch.load safety noted). 3-seed mean: 1.1161 (std 0.0009). All artifacts under 16MB. Thanks!

@G3sparky
Copy link
Copy Markdown
Author

Superseded by #1858 (Neural-Only val_bpb 1.0810, 3-seed mean — ties leaderboard leader). Closing as the canonical entry has moved on. Thanks!

@G3sparky G3sparky closed this Apr 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants