Non-record: Post-Quantization Damage Gap — 11L GPTQ Int6 + Curriculum + Sliding TTT (3-seed, 8xH100) by taka6745 · Pull Request #1818 · openai/parameter-golf

taka6745 · 2026-04-25T16:57:59Z

Track

track_non_record_16mb. Negative result, research contribution.

Headline

11L / 512d GQA transformer in 600 s on 8×H100 reaches pre-quant val_bpb 1.1009 (better than typical) but GPTQ catastrophically damages it: post-quant 3.4620 (+2.36 BPB), TTT recovers to 2.7663. Reproducible across 3 seeds (paired t ≈ 131, p < 0.001). The post-quantization damage gap is the contribution.

Three novel techniques shipped (full algorithms in README)

Entropy-bucket curriculum sampler. Wallclock-driven easy-to-hard crossfade with floor weight, pre-bucketed manifest.
Freeze-dry. 2-neighbor LSQ linear-reconstruction storage filter; drop weights well-predicted by their row neighbors.
2:4 sparsity packing. Storage-only adaptation: 3-bit values + 4-bit position-pair codes = 10 bits / 4-block (vs int6's 24).

3-seed results

	Seed 42	Seed 1337	Seed 2024	Mean	σ
pre-quant	1.1002	1.1022	1.1003	1.1009	0.0011
post-quant	3.4743	3.4422	3.4696	3.4620	0.0173
post-TTT	2.7285	2.7964	2.7741	2.7663	0.0346
artifact bytes	15,720,987	15,652,160	15,715,938	15,696,362	38,324

Eight of our world-novel late-stage follow-ups

All target the post-quant damage gap from one of four angles: softer training minimum, smarter quant grid, post-quant rescue, bigger eval-time predictor. Nothing here is a port from another competitor PR.

A. Progressive Depth-Grown Training. 3 → 6 → 11 layers with identity-init transitions; shorter full-depth window = softer minimum that survives quant. Code-complete, CPU smoke-tested.
B. Post-Quantization Calibration Loop. 5 iterations fitting LayerNorm scales/shifts + biases against the fp32 activation distribution after GPTQ. Designed. Projected -0.005 to -0.020 BPB.
C. Hard-Batch Replay During Training. Every 1000 steps replay the 100 highest-loss batches at 2× LR. Single-run hit pre-quant 1.2536 (below baseline). Post-quant never measured.
D. Vernier-Ladder Quantization. Two offset int6 grids + 1 sign bit per weight = int7-effective precision at int6 storage cost. Designed. Projected -0.003 to -0.015 BPB.
E. Lossy-to-Lossless Correction Cascade. Push tolerant MLP layers to int3/int4 + ship a tiny per-layer fp32 bias-correction vector. JPEG-style. Designed. Projected -0.003 to -0.015 BPB.
F. Kombucha Compressibility Objective. Last 60 s of training switches loss to CE + λ · L_compress to bias weights toward the int6 grid before GPTQ runs. Designed. Projected -0.002 to -0.012 BPB.
G. Bezier Control-Point Weight Factorization. Reconstruct 36M weights from ≤10K Bezier anchors at load time; bypasses GPTQ entirely. Designed. Projected -0.005 to -0.030 BPB.
H. Online N-Gram Cache (moonshot). Causal n-gram cache built from already-scored val tokens; blend with LM softmax. Approved. Projected moonshot -0.05 to -0.15 BPB.

Full descriptions in README §Late-Stage Promising Follow-Ups.

Files

records/track_non_record_16mb/2026-04-26_PostQuantDamageGap_11L_GPTQ_TTT_Curriculum/: README.md, submission.json, train_gpt.py, requirements.txt, figures/ (7 PNGs), train_log_seed{42,1337,2024}.log.

🤖 Generated with Claude Code

Adds records/track_non_record_16mb/2026-04-26_PostQuantDamageGap_11L_GPTQ_TTT_Curriculum/ Negative result documenting a +2.36 BPB gap between pre-quantization (1.10) and post-quantization (3.46) val_bpb on an 11L GQA transformer trained with entropy-bucket curriculum + speed levers in 600 s on 8xH100. Test-time training recovers ~0.70 BPB; final 3-seed mean is 2.7663 (below baseline). Includes diagrams + code excerpts for each unique technique (curriculum, GPTQ int6+int5, 2:4 sparsity, freeze-dry, Lloyd-Max, DualMLP, sliding TTT) and a ready-to-validate progressive depth-grown training mitigation plan.

…ection - Add 6 PNG figures: damage gap, per-seed bars, TTT recovery trajectory, curriculum schedule, freeze-dry residual histogram, 2:4 bit budget. - Rewrite README technique sections with explicit NOVEL / NOVEL ADAPTATION / PORT / STANDARD tags. Three submission-novel contributions called out with full algorithm + why-novel framing: entropy-bucket curriculum sampler (wallclock-driven crossfade + floor weight + pre-bucketed manifest); freeze-dry (2-neighbor LSQ linear-reconstruction storage filter); 2:4 sparsity packing (3-bit values + 4-bit position-pair encoding for storage-only, not compute). - Remove Compute Sponsorship Request section entirely (per user request); drop the corresponding TOC entry.

- Replace U+2014 em dash with ASCII hyphen across README, submission.json, train_gpt.py docstrings (50 sites total). - Rename trailing section header from "Footnote - On Honesty" to "Footnote".

Documents 6 next-step candidates surfaced from auditing phase6 results, phase2_speed ledger, and recent research docs. Three are documented wins not yet shipped (Pre-Quant AdamW TTT -0.014 BPB / COMP openai#1485, Post-Quant Calibration Loop -0.005..-0.020 BPB / IDEA-048, Lloyd-Max codebook re-wire -0.010..-0.030 BPB - artifact already on disk); three are cheap untested ideas (progressive depth-grown training, d07 sleep_replay re-validate, DualMLP-off diagnostic A/B). Each entry explains what the technique is, where it's documented, and the effort to ship/validate. New figure fig7_followups.png shows the projected BPB improvement bands.

Drop competitor ports and already-shipped items from the follow-ups list. Keep only the three novel ideas we developed: A. Progressive depth-grown training (code complete + CPU smoke-tested) B. Post-quantization calibration loop (specced, ~200 LOC, projected -0.005 to -0.020 BPB) C. Hard-batch replay during training (single-run signal: pre-quant val_bpb 1.2536 at 817K tok/s, post-quant never measured) Removed: Pre-Quant AdamW TTT (port from another competitor's PR); Lloyd-Max codebook re-wire (already in this submission's stack); DualMLP-off ablation (diagnostic, not a novel forward path). Each entry now explains what the technique IS in plain English, why it targets the post-quant damage gap, where it stands today, and what the concrete deliverable would be. No internal codenames or doc paths in user-facing text. New chart fig7_followups.png reflects the three.

Earlier draft only listed 3 follow-ups; pulled the full set of our world-novel candidates from docs/ideas/ that target the post-quant gap. Each entry: name + brief overview + status. A. Progressive Depth-Grown Training - softer minimum, code complete B. Post-Quantization Calibration Loop - activation rescue post-GPTQ C. Hard-Batch Replay During Training - measured pre-quant 1.2536 D. Vernier-Ladder Quantization - int7-effective at int6 cost E. Lossy-to-Lossless Correction Cascade - JPEG-style mixed precision F. Kombucha Compressibility Objective - last 60s loss biased to grid G. Bezier Control-Point Weight Factorization - 10K anchors not 36M weights H. Online N-Gram Cache (Moonshot) - approved, -0.05 to -0.15 BPB Chart fig7_followups.png redrawn with all 8 entries, three colors: projected from analysis (green), measured signal (blue), untested (grey).

Takoda Mundy added 6 commits April 26, 2026 02:57

PR: drop em dashes everywhere, rename Footnote heading

d8f1c0d

- Replace U+2014 em dash with ASCII hyphen across README, submission.json, train_gpt.py docstrings (50 sites total). - Rename trailing section header from "Footnote - On Honesty" to "Footnote".

X-Abhishek-X mentioned this pull request Apr 26, 2026

[non-record / wishlist] E2E TTT — full-model SGD per chunk, val_bpb 1.07063, demonstrates "healing property" #1837

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Post-Quantization Damage Gap — 11L GPTQ Int6 + Curriculum + Sliding TTT (3-seed, 8xH100)#1818

Non-record: Post-Quantization Damage Gap — 11L GPTQ Int6 + Curriculum + Sliding TTT (3-seed, 8xH100)#1818
taka6745 wants to merge 6 commits intoopenai:mainfrom
taka6745:non-record-post-quant-damage-gap

taka6745 commented Apr 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

taka6745 commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Track

Headline

Three novel techniques shipped (full algorithms in README)

3-seed results

Eight of our world-novel late-stage follow-ups

Files

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

taka6745 commented Apr 25, 2026 •

edited

Loading