Skip to content

12L QAT Int4-MLP + Int6-Attn (Non-record)#910

Open
Meirzhan05 wants to merge 6 commits intoopenai:mainfrom
Meirzhan05:main
Open

12L QAT Int4-MLP + Int6-Attn (Non-record)#910
Meirzhan05 wants to merge 6 commits intoopenai:mainfrom
Meirzhan05:main

Conversation

@Meirzhan05
Copy link
Copy Markdown

@Meirzhan05 Meirzhan05 commented Mar 26, 2026

12L QAT Int4-MLP + Int6-Attn (Non-record Submission)

Summary

  • Adds mixed-precision Quantization-Aware Training (int4 for MLP, int6 for attention) via Straight-Through Estimator from the start of training
  • Uses the ~3MB byte savings from int4 MLP compression to fund a 12th transformer layer (vs 11 in SOTA)
  • Halves the sliding window eval stride from 64 → 32 for better per-token context

3-Seed Validation Results

Seed Post-EMA BPB Post-Quant BPB Artifact Size Wall Time
42 1.1593 1.2234 12,972,053 B 600s
314 1.1603 1.2386 12,951,841 B 600s
999 1.1598 1.2255 12,970,681 B 600s
Mean 1.1598 1.2292 12.96 MB
Std 0.0005 0.0082

Note: This is a non-record submission. The post-quantization BPB (1.2292) does not beat the naive baseline (1.2244) due to a large quantization gap (~0.069 BPB). The QAT approach with int4 MLP quantization introduces too much distortion at this model scale.

Changes

Component SOTA #1 (1.1194) This PR
Layers 11 12
MLP precision int6 int4 (QAT)
Attn precision int6 int6 (QAT)
Eval stride 64 32
QAT Late (last ~10%) Full (from step 0)

Implementation

QAT is applied directly in MLP.forward and CausalSelfAttention.forward on the banked weight tensors via _fake_quantize_ste (row-wise scale, STE gradient). Clip ranges: mlp_clip=7 (int4), attn_clip=31 (int6). Post-training quantization uses GPTQ-lite clip search with the same ranges.

All techniques from PR #549 are inherited: LeakyReLU(0.5)², Legal Score-First TTT, Parallel Muon + Parameter Banking, XSA on last 4 layers, Partial RoPE (16/64 dims), LN Scale 1/sqrt(layer+1), EMA (0.997), SmearGate, BigramHash(2048), Value Embedding.

Test Plan

  • Verify artifact size < 16MB after LZMA compression (12.96 MB mean)
  • Verify training completes in ≤ 600s on 8xH100 SXM (600s all seeds)
  • Run 3-seed validation (seeds 42, 314, 999)
  • Confirm val_bpb improvement over baseline — not achieved (non-record)

Hardware

  • 8x H100 SXM 80GB (RunPod)
  • PyTorch 2.9.1+cu128
  • Flash Attention 3

@MatoTeziTanka
Copy link
Copy Markdown

Community Review — 2L QAT Int4-MLP + Int6-Attn

BPB: (not parsed — see PR title) | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1416/#1423 pattern)

What I found in the code (head SHA 8dd90897ab7b, file records/track_10min_16mb/2026-03-25_QAT_Int4MLP_12L/train_gpt.py):

The TTT path at line 1098 implements the score-first-per-chunk pattern: each chunk is scored under torch.no_grad() / inference_mode() before the base_model.train() + SGD adaptation runs on that same chunk, with an is_last_chunk guard so the final chunk gets no adaptation pass. This is the structural shape the legal frontier uses (PRs #1416 erichroepke, #1423 aryanbhosale).

Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk ci is scored under weights adapted only on chunks 0..ci-1. No prequant_ttt_adapt_adamw(val_tokens, ...) multi-epoch fine-tune, no scored-region SLOT, no target-in-key n-gram cache.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=12, vocab=1024, code=91173 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass.

Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=12, vocab=1024, code=91173 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

@Meirzhan05 Meirzhan05 changed the title 2L QAT Int4-MLP + Int6-Attn 12L QAT Int4-MLP + Int6-Attn (Non-record) Apr 22, 2026
@Meirzhan05
Copy link
Copy Markdown
Author

Meirzhan05 commented Apr 22, 2026

@MatoTeziTanka
3-seed validation complete on 8×H100 SXM (RunPod):

Seed Post-Quant BPB Artifact Wall Time
42 1.2234 12.97 MB 600s
314 1.2386 12.95 MB 600s
999 1.2255 12.97 MB 600s
Mean 1.2292 12.96 MB

All seeds pass artifact (<16MB) and wallclock (≤600s) requirements. Post-quant BPB does not beat the naive baseline (1.2244) due to the int4 MLP quantization gap (~0.069 BPB), so marking this as a non-record submission.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants