Record candidate: PR #1855 + Adaptive Hessian-Sensitivity GPTQ Clip — val_bpb 1.06310 (3-seed mean)#1962
Open
chris-colinsky wants to merge 1 commit intoopenai:mainfrom
Conversation
3-seed mean: 1.06310 BPB (std 0.00102) on 8xH100 SXM, 600s. Follow-up to PR openai#1689 by the same author. Ports the adaptive Hessian-sensitivity GPTQ clipping technique introduced in openai#1689 (at 1.0822 BPB) onto PR openai#1855's stack (current SOTA, 1.06108) to test composability with LQER asymmetric quantization, the 9-hparam greedy stack, and the rest of the modern pipeline. The technique replaces three hand-tuned per-group clip sigmas (MLP_CLIP_SIGMAS, ATTN_CLIP_SIGMAS, MATRIX_CLIP_SIGMAS) with one automated per-tensor selection from H_diag.mean()*row_var, with a binary-search offset that preserves PR openai#1855's numel-weighted log-average compression budget. Reproduces hand-tuned result within ~2sigma at +0.00203 BPB; eliminates 3 hyperparameters from the search space. A second technique (Hessian-sensitivity-driven mixed-precision GPTQ, 25/50/25 int5/int6/int7) was implemented in the same codebase under MIXED_PRECISION_HESSIAN env var and is documented as a negative result (+0.0045 quant penalty vs all-int6 + LQER).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up to PR #1689 by the same author. Ports adaptive Hessian-sensitivity GPTQ clipping (introduced in #1689 at 1.0822 BPB on the PR #1394 base) onto PR #1855's stack to test composability with LQER asymmetric quantization, the 9-hparam greedy stack, CaseOps, and the rest of the modern pipeline.
val_bpb: 1.06310 (3-seed mean, std 0.00102) | ~15.9 MB | 8×H100 SXM, 600 s wallclock | seeds 42, 1337, 999.
vs current SOTA (PR #1855, 1.06108): +0.00203 BPB / +0.00444 nats.
What's new
The technique replaces three hand-tuned per-group GPTQ clip sigmas (
MLP_CLIP_SIGMAS=11.5,ATTN_CLIP_SIGMAS=13.0,MATRIX_CLIP_SIGMAS=12.85) with one automated per-tensor selection:offsetis determined by binary search such that the numel-weighted log-average of σ across all matrix tensors equals PR #1855's hand-tuned log-average — i.e. the overall compression budget is preserved exactly, only the per-tensor distribution shifts according to Hessian sensitivity. See README §"Adaptive Hessian-Sensitivity GPTQ Clipping" for the derivation, intuition, and the full per-tensor σ allocation table from seed 42.Reproduces hand-tuned within ~2σ at +0.00203 BPB. Eliminates 3 hyperparameters from the search space. TTT recovery dynamics match PR #1855's exactly (-0.01272 mean across 3 seeds in both submissions, four decimals identical) — adaptive clip composes cleanly with phased TTT.
Mixed-precision ablation (gated, negative result)
A second technique — Hessian-sensitivity-driven mixed-precision GPTQ (bottom 25 % of tensors → int5, middle 50 % → int6, top 25 % → int7, 6.0 avg bits preserved) — was implemented in the same codebase under
MIXED_PRECISION_HESSIAN=1. Single-seed test on PR #1855's stack: post-quant penalty +0.01310 vs PR #1855's +0.00858 (i.e. +0.0045 worse). The 16 int5 tensors lose more precision than the 16 int7 tensors gain on this heavily-tuned base. Disabled in this submission; included in the codebase as a reproducible negative result.Per-seed results
3-seed std: 0.00102 BPB / 0.00224 nats. All artifacts under 16 MB; all runs hit the 600 s wallclock cap.
Compliance (track_10min_16mb)
Reproduction
See
records/track_10min_16mb/2026-04-30_AdaptiveHessianClip_PR1855_1.0631/README.md§"Reproduction". Same RunPod container as PR #1855 (runpod/parameter-golf:latest), CaseOps dataset fromromeerp/parameter-golf-caseops-v1on HF Hub, FA3 wheel fromwindreamer.github.io/flash-attention3-wheels/cu128_torch291/, lrzip via apt. Two new env vars on top of PR #1855's recipe:ADAPTIVE_HESSIAN_CLIP=1andTTT_LORA_RANK=56.Test plan
train_seed{42,1337,999}.log)Credits
Direct ancestor stack: PR #1855 → #1851 → #1797 → #1787 → #1736 → #1729 → #1626 → #1530 → #1493 → #1394. Adaptive Hessian-sensitivity GPTQ clipping is from PR #1689 by this author;
TTT_LORA_RANK=56tweak from open PR #1935 by @vimeto. Full lineage with descriptions in README "Credits" section.