Skip to content

WIP: Sequential GPTQ with Groupwise Int6 — improved post-training quantization on SP4096 base#1664

Open
zoharb157 wants to merge 2 commits intoopenai:mainfrom
zoharb157:submission/sp4096-sequential-gptq-groupwise-int6
Open

WIP: Sequential GPTQ with Groupwise Int6 — improved post-training quantization on SP4096 base#1664
zoharb157 wants to merge 2 commits intoopenai:mainfrom
zoharb157:submission/sp4096-sequential-gptq-groupwise-int6

Conversation

@zoharb157
Copy link
Copy Markdown

@zoharb157 zoharb157 commented Apr 16, 2026

Summary

Improve post-training quantization on PR #1218 base (SP4096, MLP 4×, WD 0.085, XSA-all, brotli). Three algorithmic improvements with zero training-time cost:

  • Sequential cross-layer GPTQ propagation: Quantize layers one at a time, inject quantized weights back into the model, then collect Hessians for later layers. Later layers' Hessians reflect actual quantized activations, capturing cross-layer error accumulation.
  • Groupwise int6 scales (group_size=128): Per-group fp16 scales instead of per-row, giving finer control over heterogeneous weight distributions. ~2% scale storage overhead for significant MSE reduction.
  • Hessian-weighted scale selection: Minimize sum(H_diag * (W-Q)^2) instead of MSE when selecting per-row clip percentiles, directly optimizing output reconstruction quality.

Implementation is complete (280 lines changed). Requesting compute credits for 3-seed validation on 8×H100.

Expected −0.004 to −0.008 dBPB improvement from recovering quantization damage (pre-quant→post-quant gap is 0.012 BPB in baseline).

Test plan

  • Reproduce Record: 4096-Vocab + 4.0-MLP-mult + 0.085-WD + Simplifications — val_bpb 1.09785 (3-seed mean) #1218 baseline at 3 seeds (1337, 42, 2025)
  • Ablate: sequential propagation only (GPTQ_SEQUENTIAL=1 GPTQ_GROUP_SIZE=0)
  • Ablate: groupwise scales only (GPTQ_SEQUENTIAL=0 GPTQ_GROUP_SIZE=128)
  • Ablate: Hessian-weighted selection only (per-row mode)
  • Full stack: all three combined (default config)
  • Paired t-test across 3 seeds, p < 0.01, dBPB > 0.003
  • Verify artifact stays under 16MB cap with groupwise scale overhead

Improve post-training quantization on PR openai#1218 base (SP4096, MLP 4x, WD 0.085).
Three changes: sequential cross-layer error propagation, groupwise int6 scales
(group_size=128), and Hessian-weighted scale selection. Expected -0.004 to -0.008
dBPB with zero training-time cost.

Made-with: Cursor
Three improvements to the post-training quantization pipeline on PR openai#1218:

1. Sequential cross-layer GPTQ: quantize layers one at a time, injecting
   quantized weights back before collecting later layers' Hessians. This
   propagates quantization error forward so later Hessians are accurate.

2. Groupwise int6 scales (group_size=128): per-group fp16 scales instead
   of per-row, giving finer control over weight variance within rows.

3. Hessian-weighted scale selection: minimize H_diag-weighted error instead
   of MSE when selecting per-row clip percentiles.

Zero training-time cost. Expected -0.004 to -0.008 dBPB.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant