Non-Record: BPB 1.1334 — 7000-Step Training + Mixed Int6/Int8 Quantization + Legal TTT#598
Conversation
… BPB) - Non-record submission: 1.1334 BPB, 15.70 MB artifact (4×A100-40GB) - Mixed quantization: int6 per-row for MLP/attn, int8 per-tensor for rest - 7000 training steps (vs 5200 baseline) with GEPA architecture - Legal score-first TTT: SGD 10 epochs, -0.0142 BPB gain - Beats prior non-record best (1.1425) by 0.009 BPB
Community Review — Non-Record: BPB 1.1334 — 7000-Step Training + Mixed Int6/Int8 Quantization + Legal TTTBPB: 1.1334 | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1416/#1423 pattern) What I found in the code (head SHA The TTT path at line 399 implements the score-first-per-chunk pattern: each chunk is scored under Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.06s, dim=512, layers=11, vocab=1024, code=77796 B, SMOKE_TEST_PASS Verdict: LOOKS CLEAN. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass. Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually. Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.06s, dim=512, layers=11, vocab=1024, code=77796 B, SMOKE_TEST_PASS. Classification via deterministic AST-based |
Non-Record: 4×A100-40GB
val_bpb = 1.1334 | Pre-TTT: 1.1476 | Artifact: 15.70 MB (headroom: 297 KB)
What Changed
Extended training to 7000 steps (from the typical 5200) with a longer warmdown cosine anneal (step 3500→7000), combined with mixed int6/int8 quantization to keep the 27M-parameter model under 16 MB. Legal score-first TTT (10 epochs SGD with momentum) yields a further −0.0142 BPB improvement.
The base model improvement (−0.0133 pre-TTT) comes from longer training plus the GEPA architecture. Fewer TTT epochs (10 vs 30) mean faster eval (40% less wall time) at the cost of a smaller TTT gain (−0.0142 vs −0.0184).
Architecture: 11L GEPA
Mixed Quantization
Dual-scheme compression for the 27M-parameter model:
27.5 MB payload → 15.63 MB after zstd-22 (3.89× compression) + 76 KB code = 15.70 MB total.
TTT Protocol (Legal Score-First)
SGD with momentum (0.9) at lr=0.002, 10 epochs per 32K-token chunk, stride=64, freezing first 2 blocks. Score-first: every token scored under
torch.inference_mode()before any weight update.Limitations
Credits
And all contributors to the parameter-golf competition.