Skip to content

Non-Record: BPB 1.1334 — 7000-Step Training + Mixed Int6/Int8 Quantization + Legal TTT#598

Open
Christopher-Lee-McClendon wants to merge 1 commit intoopenai:mainfrom
Christopher-Lee-McClendon:submission/11L-gepa-mixed-quant-7k-legal-ttt
Open

Non-Record: BPB 1.1334 — 7000-Step Training + Mixed Int6/Int8 Quantization + Legal TTT#598
Christopher-Lee-McClendon wants to merge 1 commit intoopenai:mainfrom
Christopher-Lee-McClendon:submission/11L-gepa-mixed-quant-7k-legal-ttt

Conversation

@Christopher-Lee-McClendon
Copy link
Copy Markdown
Contributor

@Christopher-Lee-McClendon Christopher-Lee-McClendon commented Mar 24, 2026

Non-Record: 4×A100-40GB

val_bpb = 1.1334 | Pre-TTT: 1.1476 | Artifact: 15.70 MB (headroom: 297 KB)


What Changed

Extended training to 7000 steps (from the typical 5200) with a longer warmdown cosine anneal (step 3500→7000), combined with mixed int6/int8 quantization to keep the 27M-parameter model under 16 MB. Legal score-first TTT (10 epochs SGD with momentum) yields a further −0.0142 BPB improvement.

This Work Prior VE128+RoPE (30ep TTT)
Pre-TTT BPB 1.1476 1.1609
Post-TTT BPB 1.1334 1.1425
Training steps 7,000 5,200
TTT epochs 10 30
Eval time 2,194s 3,662s
Artifact size 15.70 MB 15.65 MB

The base model improvement (−0.0133 pre-TTT) comes from longer training plus the GEPA architecture. Fewer TTT epochs (10 vs 30) mean faster eval (40% less wall time) at the cost of a smaller TTT gain (−0.0142 vs −0.0184).

Architecture: 11L GEPA

  • 11 unique layers (no depth recurrence), d=512, 8Q/4KV GQA heads
  • ReLU² (Star-ReLU) activation in 3× MLP (1536 hidden)
  • Cross-sequence attention (XSA) on last 4 layers
  • Exponential moving average (decay 0.997)
  • Bigram hash embeddings (2048 buckets, 128d)
  • Partial RoPE (16/64 dims) with YARN scaling
  • Value embeddings (128d) on layers 9–10
  • U-Net skip connections across layer pairs
  • LN depth scaling (1/√(layer+1))
  • Late QAT with GPTQ-lite clip search (5 candidates/row), enabled at step 6476

Mixed Quantization

Dual-scheme compression for the 27M-parameter model:

  • Int6 per-row (GPTQ-lite): attention projections + MLP weights (bulk of params)
  • Int8 per-tensor (scalar scale): layer norms, value embeddings, biases, embedding tables

27.5 MB payload → 15.63 MB after zstd-22 (3.89× compression) + 76 KB code = 15.70 MB total.

TTT Protocol (Legal Score-First)

SGD with momentum (0.9) at lr=0.002, 10 epochs per 32K-token chunk, stride=64, freezing first 2 blocks. Score-first: every token scored under torch.inference_mode() before any weight update.

Limitations

  • Single seed (42) — no variance estimate. Acceptable for non-record but results may vary across seeds.
  • No ablation of individual GEPA components. The architecture combines multiple techniques without isolating their contributions.
  • No LeakyReLU: PR Non-Record: BPB 1.13872 — LeakyReLU(0.5)² + Per-Layer LR Legal TTT (3 seeds) #537 showed LeakyReLU(0.5)² helps (−0.0035 BPB). This submission uses standard ReLU² instead — combining with LeakyReLU is an obvious next step.

Credits

And all contributors to the parameter-golf competition.

… BPB)

- Non-record submission: 1.1334 BPB, 15.70 MB artifact (4×A100-40GB)
- Mixed quantization: int6 per-row for MLP/attn, int8 per-tensor for rest
- 7000 training steps (vs 5200 baseline) with GEPA architecture
- Legal score-first TTT: SGD 10 epochs, -0.0142 BPB gain
- Beats prior non-record best (1.1425) by 0.009 BPB
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Non-Record: BPB 1.1334 — 7000-Step Training + Mixed Int6/Int8 Quantization + Legal TTT

BPB: 1.1334 | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1416/#1423 pattern)

What I found in the code (head SHA f94b154a740f, file records/track_non_record_16mb/2026-03-24_11L_GEPA_MixedQuant_7kSteps_LegalTTT/train_gpt.py):

The TTT path at line 399 implements the score-first-per-chunk pattern: each chunk is scored under torch.no_grad() / inference_mode() before the base_model.train() + SGD adaptation runs on that same chunk, with an is_last_chunk guard so the final chunk gets no adaptation pass. This is the structural shape the legal frontier uses (PRs #1416 erichroepke, #1423 aryanbhosale).

Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk ci is scored under weights adapted only on chunks 0..ci-1. No prequant_ttt_adapt_adamw(val_tokens, ...) multi-epoch fine-tune, no scored-region SLOT, no target-in-key n-gram cache.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.06s, dim=512, layers=11, vocab=1024, code=77796 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass.

Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.06s, dim=512, layers=11, vocab=1024, code=77796 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants