Record candidate: PR #1855 + Adaptive Hessian-Sensitivity GPTQ Clip — val_bpb 1.06310 (3-seed mean) by chris-colinsky · Pull Request #1962 · openai/parameter-golf

chris-colinsky · 2026-04-30T05:39:36Z

Summary

Follow-up to PR #1689 by the same author. Ports adaptive Hessian-sensitivity GPTQ clipping (introduced in #1689 at 1.0822 BPB on the PR #1394 base) onto PR #1855's stack to test composability with LQER asymmetric quantization, the 9-hparam greedy stack, CaseOps, and the rest of the modern pipeline.

val_bpb: 1.06310 (3-seed mean, std 0.00102) | ~15.9 MB | 8×H100 SXM, 600 s wallclock | seeds 42, 1337, 999.

vs current SOTA (PR #1855, 1.06108): +0.00203 BPB / +0.00444 nats.

What's new

The technique replaces three hand-tuned per-group GPTQ clip sigmas (MLP_CLIP_SIGMAS=11.5, ATTN_CLIP_SIGMAS=13.0, MATRIX_CLIP_SIGMAS=12.85) with one automated per-tensor selection:

σ(name) = clamp( exp( -0.15·log(H_diag(name).mean()·row_var(W(name))) + offset ),  6.0,  24.0 )

offset is determined by binary search such that the numel-weighted log-average of σ across all matrix tensors equals PR #1855's hand-tuned log-average — i.e. the overall compression budget is preserved exactly, only the per-tensor distribution shifts according to Hessian sensitivity. See README §"Adaptive Hessian-Sensitivity GPTQ Clipping" for the derivation, intuition, and the full per-tensor σ allocation table from seed 42.

Reproduces hand-tuned within ~2σ at +0.00203 BPB. Eliminates 3 hyperparameters from the search space. TTT recovery dynamics match PR #1855's exactly (-0.01272 mean across 3 seeds in both submissions, four decimals identical) — adaptive clip composes cleanly with phased TTT.

Mixed-precision ablation (gated, negative result)

A second technique — Hessian-sensitivity-driven mixed-precision GPTQ (bottom 25 % of tensors → int5, middle 50 % → int6, top 25 % → int7, 6.0 avg bits preserved) — was implemented in the same codebase under MIXED_PRECISION_HESSIAN=1. Single-seed test on PR #1855's stack: post-quant penalty +0.01310 vs PR #1855's +0.00858 (i.e. +0.0045 worse). The 16 int5 tensors lose more precision than the 16 int7 tensors gain on this heavily-tuned base. Disabled in this submission; included in the codebase as a reproducible negative result.

Per-seed results

Seed	Steps	Pre-quant	Post-quant	Post-TTT	Artifact bytes	Eval time
42	4 835	1.06498	1.07480	1.06214	15 905 000	592.9 s
1337	4 807	1.06711	1.07696	1.06417	15 918 827	521.1 s
999	4 805	1.06611	1.07570	1.06300	15 901 152	456.5 s

3-seed std: 0.00102 BPB / 0.00224 nats. All artifacts under 16 MB; all runs hit the 600 s wallclock cap.

Compliance (track_10min_16mb)

✅ Training under 600 s (596 s mean, hit wallclock cap)
✅ Artifact under 16 000 000 bytes (15.90–15.92 MB)
✅ Eval under 600 s (456–593 s)
✅ Score-first phased TTT (no pre-quant TTT)
✅ No SLOT / no ETLB / no n-gram cache
✅ 3 seeds (42, 1337, 999)

Reproduction

See records/track_10min_16mb/2026-04-30_AdaptiveHessianClip_PR1855_1.0631/README.md §"Reproduction". Same RunPod container as PR #1855 (runpod/parameter-golf:latest), CaseOps dataset from romeerp/parameter-golf-caseops-v1 on HF Hub, FA3 wheel from windreamer.github.io/flash-attention3-wheels/cu128_torch291/, lrzip via apt. Two new env vars on top of PR #1855's recipe: ADAPTIVE_HESSIAN_CLIP=1 and TTT_LORA_RANK=56.

Test plan

3-seed run completed on 8×H100 SXM (logs in train_seed{42,1337,999}.log)
All artifacts under 16 000 000 bytes
Local smoke test of adaptive clip math on synthetic tensors (math verified independent of CUDA stack)
Mixed-precision ablation single-seed test (negative result documented)
CaseOps dataset confirmed available on HF Hub
Reviewer 3-seed reproduction (pending)

Credits

Direct ancestor stack: PR #1855 → #1851 → #1797 → #1787 → #1736 → #1729 → #1626 → #1530 → #1493 → #1394. Adaptive Hessian-sensitivity GPTQ clipping is from PR #1689 by this author; TTT_LORA_RANK=56 tweak from open PR #1935 by @vimeto. Full lineage with descriptions in README "Credits" section.

3-seed mean: 1.06310 BPB (std 0.00102) on 8xH100 SXM, 600s. Follow-up to PR openai#1689 by the same author. Ports the adaptive Hessian-sensitivity GPTQ clipping technique introduced in openai#1689 (at 1.0822 BPB) onto PR openai#1855's stack (current SOTA, 1.06108) to test composability with LQER asymmetric quantization, the 9-hparam greedy stack, and the rest of the modern pipeline. The technique replaces three hand-tuned per-group clip sigmas (MLP_CLIP_SIGMAS, ATTN_CLIP_SIGMAS, MATRIX_CLIP_SIGMAS) with one automated per-tensor selection from H_diag.mean()*row_var, with a binary-search offset that preserves PR openai#1855's numel-weighted log-average compression budget. Reproduces hand-tuned result within ~2sigma at +0.00203 BPB; eliminates 3 hyperparameters from the search space. A second technique (Hessian-sensitivity-driven mixed-precision GPTQ, 25/50/25 int5/int6/int7) was implemented in the same codebase under MIXED_PRECISION_HESSIAN env var and is documented as a negative result (+0.0045 quant penalty vs all-int6 + LQER).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record candidate: PR #1855 + Adaptive Hessian-Sensitivity GPTQ Clip — val_bpb 1.06310 (3-seed mean)#1962

Record candidate: PR #1855 + Adaptive Hessian-Sensitivity GPTQ Clip — val_bpb 1.06310 (3-seed mean)#1962
chris-colinsky wants to merge 1 commit intoopenai:mainfrom
chris-colinsky:submission/2026-04-30_yolo_pr1855_adaptive_clip

chris-colinsky commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chris-colinsky commented Apr 30, 2026

Summary

What's new

Mixed-precision ablation (gated, negative result)

Per-seed results

Compliance (track_10min_16mb)

Reproduction

Test plan

Credits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant