Record: XSA-all + GPTQ + FA3 dtype fix — val_bpb 1.1161 (3-seed mean) by G3sparky · Pull Request #1494 · openai/parameter-golf

G3sparky · 2026-04-09T07:52:14Z

Record Submission

val_bpb: 1.1220 (sliding window, stride=64) | ~15.9 MB artifact | 8×H100 SXM, 600s

Key Contribution

FA3 dtype compatibility wrapper — enables Flash Attention 3 Hopper kernels on PyTorch 2.5.1 which lacks auto-casting for FA3 calls. Simple 5-line wrapper that casts to bf16 when needed.

Architecture

Built on the PR #1019 stack:

11L XSA on all layers
Full Hessian GPTQ with AR self-gen calibration
BigramHash 3072×112
EMA (decay=0.997)
LeakyReLU(0.5)²

Results

Steps	ms/step	Sliding BPB (s64)
6,244	96ms	1.1220

8×H100 80GB SXM (Vast.ai), PyTorch 2.5.1+cu124, Flash Attention 3 compiled from source.

Authors: Gavin Saunders & Tron

11L XSA on all layers + Full Hessian GPTQ + BigramHash 3072x112 + EMA. Key contribution: FA3 dtype compatibility wrapper enabling Hopper attention on PyTorch 2.5.1 without auto-casting. 8xH100 SXM, 600s, 6244 steps @ 96ms/step.

Copilot

Pull request overview

Adds a new 10min/16MB record submission folder for “XSA-all + GPTQ + FA3 dtype fix”, including the exact training log, the full training/quantization script used to produce the artifact, and accompanying submission metadata/README for reproducibility.

Changes:

Introduces an FA3 dtype compatibility wrapper to enable Flash Attention 3 Hopper kernels under PyTorch 2.5.1.
Includes the full training + full-Hessian GPTQ (AR self-gen calibration) pipeline script used for the run.
Adds record metadata (submission.json) and documentation (README) alongside the captured train.log.

Reviewed changes

Copilot reviewed 2 out of 4 changed files in this pull request and generated 4 comments.

File	Description
records/track_10min_16mb/2026-04-09_XSA_GPTQ_FA3_dtype_fix/train.log	Captured stdout/logs for the submitted run (steps, timing, eval metrics).
records/track_10min_16mb/2026-04-09_XSA_GPTQ_FA3_dtype_fix/train_gpt.py	Self-contained training + GPTQ quantization script; includes FA3 dtype wrapper and record pipeline.
records/track_10min_16mb/2026-04-09_XSA_GPTQ_FA3_dtype_fix/submission.json	Submission metadata for leaderboard/record tracking.
records/track_10min_16mb/2026-04-09_XSA_GPTQ_FA3_dtype_fix/README.md	Human-readable summary of results, architecture, and how to run.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-09T07:57:28Z

+  },
+  "hardware": "8xH100 80GB SXM (Vast.ai)",
+  "pytorch_version": "2.5.1+cu124",
+  "cuda_version": "13.0",


cuda_version doesn’t match pytorch_version: PyTorch 2.5.1+cu124 is built against CUDA 12.4, so the JSON should likely say "cuda_version": "12.4" (or update the PyTorch build string if you truly used CUDA 13.x).

Suggested change

"cuda_version": "13.0",

"cuda_version": "12.4",

Good catch — the JSON should say cuda_version 12.4 to match cu124. Will fix in the next push.

Copilot · 2026-04-09T07:57:28Z

+## Hardware
+
+- 8× NVIDIA H100 80GB HBM3 SXM (Vast.ai, Nebraska)
+- PyTorch 2.5.1+cu124, CUDA 13.0


The hardware bullet lists PyTorch 2.5.1+cu124, CUDA 13.0, which is internally inconsistent (cu124 corresponds to CUDA 12.4). Update the CUDA version here (or the PyTorch build string) to keep the record metadata accurate.

Suggested change

- PyTorch 2.5.1+cu124, CUDA 13.0

- PyTorch 2.5.1+cu124, CUDA 12.4

Agreed, will correct to CUDA 12.4 to match the PyTorch build tag.

Copilot · 2026-04-09T07:57:28Z

+            hessians[param_name] = torch.zeros(cols, cols, dtype=torch.float32, device='cpu')
+            def make_hook(pname):
+                def hook_fn(module, input, output):
+                    x = input[0].detach().float()
+                    if x.ndim == 3:
+                        x = x.reshape(-1, x.shape[-1])
+                    hessians[pname] += (x.T @ x).cpu()
+                return hook_fn


Hessian accumulation is doing a GPU matmul and then .cpu() inside every forward hook (hessians[pname] += (x.T @ x).cpu()), which forces repeated device syncs and transfers. Consider accumulating the Hessians on GPU (or in a GPU buffer) and moving them to CPU once at the end to reduce overhead during calibration.

Acknowledged. The Hessian accumulation on GPU with final .cpu() transfer is inherited from the upstream baseline. In practice the GPTQ calibration phase is < 12s on 8xH100 so the sync overhead is negligible, but keeping the Hessians on GPU until the end and doing a single bulk transfer would be cleaner. Will consider for the next iteration.

Copilot · 2026-04-09T07:57:29Z

+        quant_blob_disk = f.read()
+    quant_state = torch.load(
+        io.BytesIO(lzma.decompress(quant_blob_disk)),
+        map_location="cpu",


torch.load(...) is invoked with the default weights_only=False, which triggers PyTorch’s security FutureWarning and may become a behavioral change in future versions. If the serialized object is compatible, pass weights_only=True; otherwise consider explicitly allowlisting needed types via torch.serialization.add_safe_globals or refactoring the saved format to be weights-only.

Suggested change

map_location="cpu",

map_location="cpu",

weights_only=True,

Fair point. Adding weights_only=True where applicable. The deserialization path only loads tensor data so weights_only=True is safe and removes the FutureWarning.

MatoTeziTanka · 2026-04-11T20:06:28Z

Community Review — Record: XSA-all + GPTQ + FA3 dtype fix (val_bpb: 1.1220)

BPB: 1.1220 | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

What I found in the code (head SHA 7a94e720c629, file records/track_10min_16mb/2026-04-09_XSA_GPTQ_FA3_dtype_fix/train_gpt.py):

Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.06s, dim=512, layers=11, vocab=1024, code=102083 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline.

Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.06s, dim=512, layers=11, vocab=1024, code=102083 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

@G3sparky

- Added seeds 42, 314, 999 (replaced single seed 1337) - Mean val_bpb 1.1161 (std 0.0009), all artifacts under 16MB - Fixed author: Gavin Saunders (@G3sparky) - Updated PyTorch/CUDA versions to match actual run environment - Verified by Lauren (team QA)

G3sparky

Reviewed

G3sparky · 2026-04-22T01:46:17Z

Hi @cocohearts — this PR is ready for review. All Copilot feedback addressed (CUDA version fixed, 3-seed results added, torch.load safety noted). 3-seed mean: 1.1161 (std 0.0009). All artifacts under 16MB. Thanks!

G3sparky · 2026-04-29T06:45:58Z

Superseded by #1858 (Neural-Only val_bpb 1.0810, 3-seed mean — ties leaderboard leader). Closing as the canonical entry has moved on. Thanks!

Record: XSA-all + GPTQ + FA3 dtype fix (val_bpb: 1.1220)

3c1989d

11L XSA on all layers + Full Hessian GPTQ + BigramHash 3072x112 + EMA. Key contribution: FA3 dtype compatibility wrapper enabling Hopper attention on PyTorch 2.5.1 without auto-casting. 8xH100 SXM, 600s, 6244 steps @ 96ms/step.

Copilot AI review requested due to automatic review settings April 9, 2026 07:52

Copilot started reviewing on behalf of G3sparky April 9, 2026 07:52 View session

Copilot AI reviewed Apr 9, 2026

View reviewed changes

Fix CUDA version: 13.0 → 12.4 (matches PyTorch cu124 build)

7a94e72

G3sparky changed the title ~~Record: XSA-all + GPTQ + FA3 dtype fix (val_bpb: 1.1220)~~ Record: XSA-all + GPTQ + FA3 dtype fix — val_bpb 1.1161 (3-seed mean) Apr 18, 2026

G3sparky commented Apr 20, 2026

View reviewed changes

cocohearts mentioned this pull request Apr 28, 2026

Update Parameter Golf leaderboard with BOS fix #1902

Merged

G3sparky closed this Apr 29, 2026

	- PyTorch 2.5.1+cu124, CUDA 13.0
	- PyTorch 2.5.1+cu124, CUDA 12.4

Conversation

G3sparky commented Apr 9, 2026

Record Submission

Key Contribution

Architecture

Results

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

G3sparky Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

G3sparky Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

G3sparky Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

G3sparky Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — Record: XSA-all + GPTQ + FA3 dtype fix (val_bpb: 1.1220)

Uh oh!

G3sparky left a comment

Choose a reason for hiding this comment

Uh oh!

G3sparky commented Apr 22, 2026

Uh oh!

G3sparky commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants