Skip to content

Non-record: Post-Quantization Damage Gap — 11L GPTQ Int6 + Curriculum + Sliding TTT (3-seed, 8xH100)#1818

Open
taka6745 wants to merge 6 commits intoopenai:mainfrom
taka6745:non-record-post-quant-damage-gap
Open

Non-record: Post-Quantization Damage Gap — 11L GPTQ Int6 + Curriculum + Sliding TTT (3-seed, 8xH100)#1818
taka6745 wants to merge 6 commits intoopenai:mainfrom
taka6745:non-record-post-quant-damage-gap

Conversation

@taka6745
Copy link
Copy Markdown

@taka6745 taka6745 commented Apr 25, 2026

Track

track_non_record_16mb. Negative result, research contribution.

Headline

11L / 512d GQA transformer in 600 s on 8×H100 reaches pre-quant val_bpb 1.1009 (better than typical) but GPTQ catastrophically damages it: post-quant 3.4620 (+2.36 BPB), TTT recovers to 2.7663. Reproducible across 3 seeds (paired t ≈ 131, p < 0.001). The post-quantization damage gap is the contribution.

Three novel techniques shipped (full algorithms in README)

  1. Entropy-bucket curriculum sampler. Wallclock-driven easy-to-hard crossfade with floor weight, pre-bucketed manifest.
  2. Freeze-dry. 2-neighbor LSQ linear-reconstruction storage filter; drop weights well-predicted by their row neighbors.
  3. 2:4 sparsity packing. Storage-only adaptation: 3-bit values + 4-bit position-pair codes = 10 bits / 4-block (vs int6's 24).

3-seed results

Seed 42 Seed 1337 Seed 2024 Mean σ
pre-quant 1.1002 1.1022 1.1003 1.1009 0.0011
post-quant 3.4743 3.4422 3.4696 3.4620 0.0173
post-TTT 2.7285 2.7964 2.7741 2.7663 0.0346
artifact bytes 15,720,987 15,652,160 15,715,938 15,696,362 38,324

Eight of our world-novel late-stage follow-ups

All target the post-quant damage gap from one of four angles: softer training minimum, smarter quant grid, post-quant rescue, bigger eval-time predictor. Nothing here is a port from another competitor PR.

  • A. Progressive Depth-Grown Training. 3 → 6 → 11 layers with identity-init transitions; shorter full-depth window = softer minimum that survives quant. Code-complete, CPU smoke-tested.
  • B. Post-Quantization Calibration Loop. 5 iterations fitting LayerNorm scales/shifts + biases against the fp32 activation distribution after GPTQ. Designed. Projected -0.005 to -0.020 BPB.
  • C. Hard-Batch Replay During Training. Every 1000 steps replay the 100 highest-loss batches at 2× LR. Single-run hit pre-quant 1.2536 (below baseline). Post-quant never measured.
  • D. Vernier-Ladder Quantization. Two offset int6 grids + 1 sign bit per weight = int7-effective precision at int6 storage cost. Designed. Projected -0.003 to -0.015 BPB.
  • E. Lossy-to-Lossless Correction Cascade. Push tolerant MLP layers to int3/int4 + ship a tiny per-layer fp32 bias-correction vector. JPEG-style. Designed. Projected -0.003 to -0.015 BPB.
  • F. Kombucha Compressibility Objective. Last 60 s of training switches loss to CE + λ · L_compress to bias weights toward the int6 grid before GPTQ runs. Designed. Projected -0.002 to -0.012 BPB.
  • G. Bezier Control-Point Weight Factorization. Reconstruct 36M weights from ≤10K Bezier anchors at load time; bypasses GPTQ entirely. Designed. Projected -0.005 to -0.030 BPB.
  • H. Online N-Gram Cache (moonshot). Causal n-gram cache built from already-scored val tokens; blend with LM softmax. Approved. Projected moonshot -0.05 to -0.15 BPB.

Full descriptions in README §Late-Stage Promising Follow-Ups.

Files

records/track_non_record_16mb/2026-04-26_PostQuantDamageGap_11L_GPTQ_TTT_Curriculum/: README.md, submission.json, train_gpt.py, requirements.txt, figures/ (7 PNGs), train_log_seed{42,1337,2024}.log.

🤖 Generated with Claude Code

Takoda Mundy added 6 commits April 26, 2026 02:57
Adds records/track_non_record_16mb/2026-04-26_PostQuantDamageGap_11L_GPTQ_TTT_Curriculum/

Negative result documenting a +2.36 BPB gap between pre-quantization (1.10) and
post-quantization (3.46) val_bpb on an 11L GQA transformer trained with
entropy-bucket curriculum + speed levers in 600 s on 8xH100. Test-time training
recovers ~0.70 BPB; final 3-seed mean is 2.7663 (below baseline). Includes
diagrams + code excerpts for each unique technique (curriculum, GPTQ int6+int5,
2:4 sparsity, freeze-dry, Lloyd-Max, DualMLP, sliding TTT) and a
ready-to-validate progressive depth-grown training mitigation plan.
…ection

- Add 6 PNG figures: damage gap, per-seed bars, TTT recovery trajectory,
  curriculum schedule, freeze-dry residual histogram, 2:4 bit budget.
- Rewrite README technique sections with explicit NOVEL / NOVEL ADAPTATION /
  PORT / STANDARD tags. Three submission-novel contributions called out with
  full algorithm + why-novel framing: entropy-bucket curriculum sampler
  (wallclock-driven crossfade + floor weight + pre-bucketed manifest);
  freeze-dry (2-neighbor LSQ linear-reconstruction storage filter); 2:4
  sparsity packing (3-bit values + 4-bit position-pair encoding for
  storage-only, not compute).
- Remove Compute Sponsorship Request section entirely (per user request);
  drop the corresponding TOC entry.
- Replace U+2014 em dash with ASCII hyphen across README, submission.json,
  train_gpt.py docstrings (50 sites total).
- Rename trailing section header from "Footnote - On Honesty" to "Footnote".
Documents 6 next-step candidates surfaced from auditing phase6 results,
phase2_speed ledger, and recent research docs. Three are documented wins
not yet shipped (Pre-Quant AdamW TTT -0.014 BPB / COMP openai#1485,
Post-Quant Calibration Loop -0.005..-0.020 BPB / IDEA-048,
Lloyd-Max codebook re-wire -0.010..-0.030 BPB - artifact already on disk);
three are cheap untested ideas (progressive depth-grown training,
d07 sleep_replay re-validate, DualMLP-off diagnostic A/B).
Each entry explains what the technique is, where it's documented, and
the effort to ship/validate. New figure fig7_followups.png shows the
projected BPB improvement bands.
Drop competitor ports and already-shipped items from the follow-ups list.
Keep only the three novel ideas we developed:

  A. Progressive depth-grown training (code complete + CPU smoke-tested)
  B. Post-quantization calibration loop (specced, ~200 LOC, projected
     -0.005 to -0.020 BPB)
  C. Hard-batch replay during training (single-run signal: pre-quant
     val_bpb 1.2536 at 817K tok/s, post-quant never measured)

Removed: Pre-Quant AdamW TTT (port from another competitor's PR);
Lloyd-Max codebook re-wire (already in this submission's stack);
DualMLP-off ablation (diagnostic, not a novel forward path).

Each entry now explains what the technique IS in plain English, why it
targets the post-quant damage gap, where it stands today, and what the
concrete deliverable would be. No internal codenames or doc paths in
user-facing text. New chart fig7_followups.png reflects the three.
Earlier draft only listed 3 follow-ups; pulled the full set of our
world-novel candidates from docs/ideas/ that target the post-quant gap.
Each entry: name + brief overview + status.

A. Progressive Depth-Grown Training - softer minimum, code complete
B. Post-Quantization Calibration Loop - activation rescue post-GPTQ
C. Hard-Batch Replay During Training - measured pre-quant 1.2536
D. Vernier-Ladder Quantization - int7-effective at int6 cost
E. Lossy-to-Lossless Correction Cascade - JPEG-style mixed precision
F. Kombucha Compressibility Objective - last 60s loss biased to grid
G. Bezier Control-Point Weight Factorization - 10K anchors not 36M weights
H. Online N-Gram Cache (Moonshot) - approved, -0.05 to -0.15 BPB

Chart fig7_followups.png redrawn with all 8 entries, three colors:
projected from analysis (green), measured signal (blue), untested (grey).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant