Skip to content

Record candidate: PR #1855 + Adaptive Hessian-Sensitivity GPTQ Clip — val_bpb 1.06310 (3-seed mean)#1962

Open
chris-colinsky wants to merge 1 commit intoopenai:mainfrom
chris-colinsky:submission/2026-04-30_yolo_pr1855_adaptive_clip
Open

Record candidate: PR #1855 + Adaptive Hessian-Sensitivity GPTQ Clip — val_bpb 1.06310 (3-seed mean)#1962
chris-colinsky wants to merge 1 commit intoopenai:mainfrom
chris-colinsky:submission/2026-04-30_yolo_pr1855_adaptive_clip

Conversation

@chris-colinsky
Copy link
Copy Markdown

Summary

Follow-up to PR #1689 by the same author. Ports adaptive Hessian-sensitivity GPTQ clipping (introduced in #1689 at 1.0822 BPB on the PR #1394 base) onto PR #1855's stack to test composability with LQER asymmetric quantization, the 9-hparam greedy stack, CaseOps, and the rest of the modern pipeline.

val_bpb: 1.06310 (3-seed mean, std 0.00102) | ~15.9 MB | 8×H100 SXM, 600 s wallclock | seeds 42, 1337, 999.

vs current SOTA (PR #1855, 1.06108): +0.00203 BPB / +0.00444 nats.

What's new

The technique replaces three hand-tuned per-group GPTQ clip sigmas (MLP_CLIP_SIGMAS=11.5, ATTN_CLIP_SIGMAS=13.0, MATRIX_CLIP_SIGMAS=12.85) with one automated per-tensor selection:

σ(name) = clamp( exp( -0.15·log(H_diag(name).mean()·row_var(W(name))) + offset ),  6.0,  24.0 )

offset is determined by binary search such that the numel-weighted log-average of σ across all matrix tensors equals PR #1855's hand-tuned log-average — i.e. the overall compression budget is preserved exactly, only the per-tensor distribution shifts according to Hessian sensitivity. See README §"Adaptive Hessian-Sensitivity GPTQ Clipping" for the derivation, intuition, and the full per-tensor σ allocation table from seed 42.

Reproduces hand-tuned within ~2σ at +0.00203 BPB. Eliminates 3 hyperparameters from the search space. TTT recovery dynamics match PR #1855's exactly (-0.01272 mean across 3 seeds in both submissions, four decimals identical) — adaptive clip composes cleanly with phased TTT.

Mixed-precision ablation (gated, negative result)

A second technique — Hessian-sensitivity-driven mixed-precision GPTQ (bottom 25 % of tensors → int5, middle 50 % → int6, top 25 % → int7, 6.0 avg bits preserved) — was implemented in the same codebase under MIXED_PRECISION_HESSIAN=1. Single-seed test on PR #1855's stack: post-quant penalty +0.01310 vs PR #1855's +0.00858 (i.e. +0.0045 worse). The 16 int5 tensors lose more precision than the 16 int7 tensors gain on this heavily-tuned base. Disabled in this submission; included in the codebase as a reproducible negative result.

Per-seed results

Seed Steps Pre-quant Post-quant Post-TTT Artifact bytes Eval time
42 4 835 1.06498 1.07480 1.06214 15 905 000 592.9 s
1337 4 807 1.06711 1.07696 1.06417 15 918 827 521.1 s
999 4 805 1.06611 1.07570 1.06300 15 901 152 456.5 s

3-seed std: 0.00102 BPB / 0.00224 nats. All artifacts under 16 MB; all runs hit the 600 s wallclock cap.

Compliance (track_10min_16mb)

  • ✅ Training under 600 s (596 s mean, hit wallclock cap)
  • ✅ Artifact under 16 000 000 bytes (15.90–15.92 MB)
  • ✅ Eval under 600 s (456–593 s)
  • ✅ Score-first phased TTT (no pre-quant TTT)
  • ✅ No SLOT / no ETLB / no n-gram cache
  • ✅ 3 seeds (42, 1337, 999)

Reproduction

See records/track_10min_16mb/2026-04-30_AdaptiveHessianClip_PR1855_1.0631/README.md §"Reproduction". Same RunPod container as PR #1855 (runpod/parameter-golf:latest), CaseOps dataset from romeerp/parameter-golf-caseops-v1 on HF Hub, FA3 wheel from windreamer.github.io/flash-attention3-wheels/cu128_torch291/, lrzip via apt. Two new env vars on top of PR #1855's recipe: ADAPTIVE_HESSIAN_CLIP=1 and TTT_LORA_RANK=56.

Test plan

  • 3-seed run completed on 8×H100 SXM (logs in train_seed{42,1337,999}.log)
  • All artifacts under 16 000 000 bytes
  • Local smoke test of adaptive clip math on synthetic tensors (math verified independent of CUDA stack)
  • Mixed-precision ablation single-seed test (negative result documented)
  • CaseOps dataset confirmed available on HF Hub
  • Reviewer 3-seed reproduction (pending)

Credits

Direct ancestor stack: PR #1855#1851#1797#1787#1736#1729#1626#1530#1493#1394. Adaptive Hessian-sensitivity GPTQ clipping is from PR #1689 by this author; TTT_LORA_RANK=56 tweak from open PR #1935 by @vimeto. Full lineage with descriptions in README "Credits" section.

3-seed mean: 1.06310 BPB (std 0.00102) on 8xH100 SXM, 600s.

Follow-up to PR openai#1689 by the same author. Ports the adaptive
Hessian-sensitivity GPTQ clipping technique introduced in openai#1689
(at 1.0822 BPB) onto PR openai#1855's stack (current SOTA, 1.06108)
to test composability with LQER asymmetric quantization, the
9-hparam greedy stack, and the rest of the modern pipeline.

The technique replaces three hand-tuned per-group clip sigmas
(MLP_CLIP_SIGMAS, ATTN_CLIP_SIGMAS, MATRIX_CLIP_SIGMAS) with one
automated per-tensor selection from H_diag.mean()*row_var, with a
binary-search offset that preserves PR openai#1855's numel-weighted
log-average compression budget. Reproduces hand-tuned result
within ~2sigma at +0.00203 BPB; eliminates 3 hyperparameters
from the search space.

A second technique (Hessian-sensitivity-driven mixed-precision
GPTQ, 25/50/25 int5/int6/int7) was implemented in the same
codebase under MIXED_PRECISION_HESSIAN env var and is documented
as a negative result (+0.0045 quant penalty vs all-int6 + LQER).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant