Non-record: Post-Quantization Damage Gap — 11L GPTQ Int6 + Curriculum + Sliding TTT (3-seed, 8xH100)#1818
Open
taka6745 wants to merge 6 commits intoopenai:mainfrom
Open
Non-record: Post-Quantization Damage Gap — 11L GPTQ Int6 + Curriculum + Sliding TTT (3-seed, 8xH100)#1818taka6745 wants to merge 6 commits intoopenai:mainfrom
taka6745 wants to merge 6 commits intoopenai:mainfrom
Conversation
added 6 commits
April 26, 2026 02:57
Adds records/track_non_record_16mb/2026-04-26_PostQuantDamageGap_11L_GPTQ_TTT_Curriculum/ Negative result documenting a +2.36 BPB gap between pre-quantization (1.10) and post-quantization (3.46) val_bpb on an 11L GQA transformer trained with entropy-bucket curriculum + speed levers in 600 s on 8xH100. Test-time training recovers ~0.70 BPB; final 3-seed mean is 2.7663 (below baseline). Includes diagrams + code excerpts for each unique technique (curriculum, GPTQ int6+int5, 2:4 sparsity, freeze-dry, Lloyd-Max, DualMLP, sliding TTT) and a ready-to-validate progressive depth-grown training mitigation plan.
…ection - Add 6 PNG figures: damage gap, per-seed bars, TTT recovery trajectory, curriculum schedule, freeze-dry residual histogram, 2:4 bit budget. - Rewrite README technique sections with explicit NOVEL / NOVEL ADAPTATION / PORT / STANDARD tags. Three submission-novel contributions called out with full algorithm + why-novel framing: entropy-bucket curriculum sampler (wallclock-driven crossfade + floor weight + pre-bucketed manifest); freeze-dry (2-neighbor LSQ linear-reconstruction storage filter); 2:4 sparsity packing (3-bit values + 4-bit position-pair encoding for storage-only, not compute). - Remove Compute Sponsorship Request section entirely (per user request); drop the corresponding TOC entry.
- Replace U+2014 em dash with ASCII hyphen across README, submission.json, train_gpt.py docstrings (50 sites total). - Rename trailing section header from "Footnote - On Honesty" to "Footnote".
Documents 6 next-step candidates surfaced from auditing phase6 results, phase2_speed ledger, and recent research docs. Three are documented wins not yet shipped (Pre-Quant AdamW TTT -0.014 BPB / COMP openai#1485, Post-Quant Calibration Loop -0.005..-0.020 BPB / IDEA-048, Lloyd-Max codebook re-wire -0.010..-0.030 BPB - artifact already on disk); three are cheap untested ideas (progressive depth-grown training, d07 sleep_replay re-validate, DualMLP-off diagnostic A/B). Each entry explains what the technique is, where it's documented, and the effort to ship/validate. New figure fig7_followups.png shows the projected BPB improvement bands.
Drop competitor ports and already-shipped items from the follow-ups list.
Keep only the three novel ideas we developed:
A. Progressive depth-grown training (code complete + CPU smoke-tested)
B. Post-quantization calibration loop (specced, ~200 LOC, projected
-0.005 to -0.020 BPB)
C. Hard-batch replay during training (single-run signal: pre-quant
val_bpb 1.2536 at 817K tok/s, post-quant never measured)
Removed: Pre-Quant AdamW TTT (port from another competitor's PR);
Lloyd-Max codebook re-wire (already in this submission's stack);
DualMLP-off ablation (diagnostic, not a novel forward path).
Each entry now explains what the technique IS in plain English, why it
targets the post-quant damage gap, where it stands today, and what the
concrete deliverable would be. No internal codenames or doc paths in
user-facing text. New chart fig7_followups.png reflects the three.
Earlier draft only listed 3 follow-ups; pulled the full set of our world-novel candidates from docs/ideas/ that target the post-quant gap. Each entry: name + brief overview + status. A. Progressive Depth-Grown Training - softer minimum, code complete B. Post-Quantization Calibration Loop - activation rescue post-GPTQ C. Hard-Batch Replay During Training - measured pre-quant 1.2536 D. Vernier-Ladder Quantization - int7-effective at int6 cost E. Lossy-to-Lossless Correction Cascade - JPEG-style mixed precision F. Kombucha Compressibility Objective - last 60s loss biased to grid G. Bezier Control-Point Weight Factorization - 10K anchors not 36M weights H. Online N-Gram Cache (Moonshot) - approved, -0.05 to -0.15 BPB Chart fig7_followups.png redrawn with all 8 entries, three colors: projected from analysis (green), measured signal (blue), untested (grey).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Track
track_non_record_16mb. Negative result, research contribution.Headline
11L / 512d GQA transformer in 600 s on 8×H100 reaches pre-quant val_bpb 1.1009 (better than typical) but GPTQ catastrophically damages it: post-quant 3.4620 (+2.36 BPB), TTT recovers to 2.7663. Reproducible across 3 seeds (paired t ≈ 131, p < 0.001). The post-quantization damage gap is the contribution.
Three novel techniques shipped (full algorithms in README)
3-seed results
Eight of our world-novel late-stage follow-ups
All target the post-quant damage gap from one of four angles: softer training minimum, smarter quant grid, post-quant rescue, bigger eval-time predictor. Nothing here is a port from another competitor PR.
CE + λ · L_compressto bias weights toward the int6 grid before GPTQ runs. Designed. Projected -0.002 to -0.012 BPB.Full descriptions in README §Late-Stage Promising Follow-Ups.
Files
records/track_non_record_16mb/2026-04-26_PostQuantDamageGap_11L_GPTQ_TTT_Curriculum/: README.md, submission.json, train_gpt.py, requirements.txt, figures/ (7 PNGs), train_log_seed{42,1337,2024}.log.🤖 Generated with Claude Code