Non-record: 10L E2E TTT-Linear + FlowRefiner — val_bpb 1.1335 ± 0.0010 (4-seed mean)#1166
Conversation
- 10-layer, 512D, E2E TTT-Linear + 1-step FlowRefiner - val_bpb 1.13472408 (int6 sliding window, stride=64, seed=42) - Artifact: 15,199,107 bytes (800K headroom under 16MB cap) - BigramHash(1536), LeakyReLU(0.5)², mixed int6/int8 + lzma - Includes three-variant size-quality comparison (11L/10L/int5) - Trained on 2×A100 PCIe 40GB, 7185 steps, ~2.2 hours
Seeds 42, 99, 1337, 2025 all completed successfully on 2×A100 PCIe 40GB. Mean sliding-window BPB: 1.13353 ± 0.00095 (4-seed std). Range: [1.13269, 1.13472]. All artifacts under 16MB cap (15.1-15.2 MB). Includes training logs and SLURM scripts for all seeds in supplementary/.
- FlowRefiner latent_dim sweep: ld=128 flips Flow from harmful to helpful - FlowRefiner hidden_dim sweep: hd=128 marginal improvement - E2E TTT num_heads, mini_batch, learning_rate sweeps - Key finding: default ld=64 was suboptimal - Updated Limitations section with sweep caveat
Community Review — Non-record: 10L E2E TTT-Linear + FlowRefiner — val_bpb 1.1335 ± 0.0010 (4-seed mean)BPB: 1.1335 | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1416/#1423 pattern) What I found in the code (head SHA The TTT path at line 1275 implements the score-first-per-chunk pattern: each chunk is scored under Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.06s, dim=512, layers=11, vocab=1024, code=104955 B, SMOKE_TEST_PASS Verdict: LOOKS CLEAN. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass. Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually. Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.06s, dim=512, layers=11, vocab=1024, code=104955 B, SMOKE_TEST_PASS. Classification via deterministic AST-based |
Non-Record Submission: 10L E2E TTT-Linear + FlowRefiner (E2E was a README request)
val_bpb: 1.1335 ± 0.0010 (4-seed mean ± std, int6 sliding window, stride=64) | ~15.1 MB artifact | 2×A100 PCIe 40GB
Summary
10-layer transformer with end-to-end TTT-Linear refinement and a 1-step FlowRefiner, compressed to fit under the 16 MB artifact cap. The lightweight FlowRefiner is inspired in part by the FLOWR paper (arXiv:2504.10564), which uses learned flow-matching vector fields with Euler-style transport updates for efficient refinement; here we adapt that idea into a tiny hidden-state refiner rather than a pocket-conditioned 3D ligand generator. Here we use a flow-flavored residual MLP, not true source→target distribution matching (which will be the subject of a later PR).
Key Results — 4-Seed Reproducibility
All 4 seeds completed successfully. All artifacts under 16MB cap.
Three-Variant Comparison (supplementary)
Prior 11L Ablations on the Same Refiner Pair
These are earlier supporting runs on the same E2E-TTT / FlowRefiner pair from
experiments_pr549/rather than fresh 10-layer ablations for the legal submission:Synergy Note
In that earlier 11-layer study, FlowRefiner alone regressed after quantization, while the combined E2E-TTT + Flow model was best. The additive expectation from the isolated deltas is 1.12505247 BPB, whereas the actual combined run reached 1.12344104, a 0.00161 BPB improvement over additive expectation. We treat this as evidence that FlowRefiner is most useful when paired with TTT, while avoiding the claim that the same four-way ablation has already been rerun for the present 10-layer legal artifact.
Refiner Hyperparameter Sweeps (11L, PR #549 Base) — NEW
FlowRefiner latent_dim sweep (hidden_dim=256 fixed):
Key finding: Increasing
latent_dimfrom 64→128 flips FlowRefiner from harmful to helpful (−0.00106 vs baseline). The default ld=64 used in this submission was suboptimal.FlowRefiner hidden_dim sweep (latent_dim=64 fixed):
E2E TTT-Linear sweeps (all within ±0.0006 BPB — defaults are near-optimal):
Takeaway: The most actionable finding is FlowRefiner ld=128, but the 10L submission was not retrained with these settings.
Architecture
Credits
Built on PR #549 (abaybektursun) and contributions from PR #65 (aquariouseworkman), PR #69 (TevBenji), PR #187 (Idan3011), PR #265 / PR #374 (unnir), PR #315 (jfprincz), PR #77 (samacqua), PR #50 (mattqlf), PR #76 (unixmadtoonslab), and the modded-nanogpt baseline. The flow-inspired framing for the hidden-state refiner was also informed by FLOWR (Cremer et al., arXiv:2504.10564).
See
README.mdfor the detailed writeup, provenance paths to the prior 11-layer ablation logs, sweep SLURM job IDs, and supplementary variant comparison.