Non-record: 10L E2E TTT-Linear + FlowRefiner — val_bpb 1.1335 ± 0.0010 (4-seed mean) by Christopher-Lee-McClendon · Pull Request #1166 · openai/parameter-golf

Christopher-Lee-McClendon · 2026-03-31T03:03:12Z

Non-Record Submission: 10L E2E TTT-Linear + FlowRefiner (E2E was a README request)

val_bpb: 1.1335 ± 0.0010 (4-seed mean ± std, int6 sliding window, stride=64) | ~15.1 MB artifact | 2×A100 PCIe 40GB

Summary

10-layer transformer with end-to-end TTT-Linear refinement and a 1-step FlowRefiner, compressed to fit under the 16 MB artifact cap. The lightweight FlowRefiner is inspired in part by the FLOWR paper (arXiv:2504.10564), which uses learned flow-matching vector fields with Euler-style transport updates for efficient refinement; here we adapt that idea into a tiny hidden-state refiner rather than a pocket-conditioned 3D ligand generator. Here we use a flow-flavored residual MLP, not true source→target distribution matching (which will be the subject of a later PR).

Key Results — 4-Seed Reproducibility

Seed	SLURM Job	Sliding Window BPB	Artifact Size
42	55383562	1.13472	15,094,152
99	55392385	1.13388	15,198,948
1337	55392383	1.13269	15,070,964
2025	55392384	1.13284	15,117,416
Mean ± Std	—	1.13353 ± 0.00095	—

All 4 seeds completed successfully. All artifacts under 16MB cap.

Three-Variant Comparison (supplementary)

Variant	Layers	val_bpb (sw)	Total Size	Status
A: 11L + 60% warmdown	11	1.1236	16.68 MB	Over budget
B: 10L (this submission)	10	1.1335 ± 0.0010 (4-seed)	~15.2 MB	Legal
C: 11L + int5 MLP	11	1.1507	14.30 MB	Legal

Prior 11L Ablations on the Same Refiner Pair

These are earlier supporting runs on the same E2E-TTT / FlowRefiner pair from experiments_pr549/ rather than fresh 10-layer ablations for the legal submission:

Prior 11L run	Sliding BPB	Δ vs 11L baseline
Baseline	1.12440473	—
+ E2E-TTT only	1.12414225	-0.00026
+ Flow only	1.12531495	+0.00091
+ Both (Combined)	1.12344104	-0.00096

Synergy Note

In that earlier 11-layer study, FlowRefiner alone regressed after quantization, while the combined E2E-TTT + Flow model was best. The additive expectation from the isolated deltas is 1.12505247 BPB, whereas the actual combined run reached 1.12344104, a 0.00161 BPB improvement over additive expectation. We treat this as evidence that FlowRefiner is most useful when paired with TTT, while avoiding the claim that the same four-way ablation has already been rerun for the present 10-layer legal artifact.

Refiner Hyperparameter Sweeps (11L, PR #549 Base) — NEW

Update (2026-04-01): Completed 11 hyperparameter sweep runs for both FlowRefiner and E2E TTT-Linear. All use 11L PR #549 architecture, seed=42, 7,000 steps, 2×A100 PCIe 40GB. Baseline: 1.12440 BPB.

FlowRefiner latent_dim sweep (hidden_dim=256 fixed):

latent_dim	Sliding BPB	Δ vs baseline
32	1.12690	+0.00250
64 (default)	1.12564	+0.00124
128	1.12334	−0.00106

Key finding: Increasing latent_dim from 64→128 flips FlowRefiner from harmful to helpful (−0.00106 vs baseline). The default ld=64 used in this submission was suboptimal.

FlowRefiner hidden_dim sweep (latent_dim=64 fixed):

hidden_dim	Sliding BPB	Δ vs baseline
128	1.12426	−0.00014
256 (default)	1.12474	+0.00034

E2E TTT-Linear sweeps (all within ±0.0006 BPB — defaults are near-optimal):

Parameter	Values Tested	Best	Δ vs default
num_heads	4, 8, 16	8	—
mini_batch	8, 16, 32	16	—
base_lr	0.5, 1.0, 2.0	0.5	−0.00059

Takeaway: The most actionable finding is FlowRefiner ld=128, but the 10L submission was not retrained with these settings.

Architecture

10 layers, 512D, 8H/4KV (GQA), 3×MLP LeakyReLU(0.5)²
E2E TTT-Linear (1.08M params): per-head inner-loop SGD during train+eval
1-step FlowRefiner (98K params): latent-space flow matching
BigramHash(1536), XSA, U-Net skips, VE128, Partial RoPE, SmearGate
EMA + SWA + Late QAT

Credits

Built on PR #549 (abaybektursun) and contributions from PR #65 (aquariouseworkman), PR #69 (TevBenji), PR #187 (Idan3011), PR #265 / PR #374 (unnir), PR #315 (jfprincz), PR #77 (samacqua), PR #50 (mattqlf), PR #76 (unixmadtoonslab), and the modded-nanogpt baseline. The flow-inspired framing for the hidden-state refiner was also informed by FLOWR (Cremer et al., arXiv:2504.10564).

See README.md for the detailed writeup, provenance paths to the prior 11-layer ablation logs, sweep SLURM job IDs, and supplementary variant comparison.

- 10-layer, 512D, E2E TTT-Linear + 1-step FlowRefiner - val_bpb 1.13472408 (int6 sliding window, stride=64, seed=42) - Artifact: 15,199,107 bytes (800K headroom under 16MB cap) - BigramHash(1536), LeakyReLU(0.5)², mixed int6/int8 + lzma - Includes three-variant size-quality comparison (11L/10L/int5) - Trained on 2×A100 PCIe 40GB, 7185 steps, ~2.2 hours

Seeds 42, 99, 1337, 2025 all completed successfully on 2×A100 PCIe 40GB. Mean sliding-window BPB: 1.13353 ± 0.00095 (4-seed std). Range: [1.13269, 1.13472]. All artifacts under 16MB cap (15.1-15.2 MB). Includes training logs and SLURM scripts for all seeds in supplementary/.

- FlowRefiner latent_dim sweep: ld=128 flips Flow from harmful to helpful - FlowRefiner hidden_dim sweep: hd=128 marginal improvement - E2E TTT num_heads, mini_batch, learning_rate sweeps - Key finding: default ld=64 was suboptimal - Updated Limitations section with sweep caveat

MatoTeziTanka · 2026-04-11T20:07:27Z

Community Review — Non-record: 10L E2E TTT-Linear + FlowRefiner — val_bpb 1.1335 ± 0.0010 (4-seed mean)

BPB: 1.1335 | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1416/#1423 pattern)

What I found in the code (head SHA ce4fe661af9a, file records/track_non_record_16mb/2026-03-30_10L_E2E_TTT_FlowRefiner/train_gpt.py):

The TTT path at line 1275 implements the score-first-per-chunk pattern: each chunk is scored under torch.no_grad() / inference_mode() before the base_model.train() + SGD adaptation runs on that same chunk, with an is_last_chunk guard so the final chunk gets no adaptation pass. This is the structural shape the legal frontier uses (PRs #1416 erichroepke, #1423 aryanbhosale).

Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk ci is scored under weights adapted only on chunks 0..ci-1. No prequant_ttt_adapt_adamw(val_tokens, ...) multi-epoch fine-tune, no scored-region SLOT, no target-in-key n-gram cache.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.06s, dim=512, layers=11, vocab=1024, code=104955 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass.

Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.06s, dim=512, layers=11, vocab=1024, code=104955 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Christopher-Lee-McClendon added 2 commits March 30, 2026 23:02

Docs: add FLOWR inspiration and prior ablation context

fdb829f

Christopher-Lee-McClendon changed the title ~~Non-record: 10L E2E TTT-Linear + FlowRefiner — val_bpb 1.1347~~ Non-record: 10L E2E TTT-Linear + FlowRefiner — val_bpb 1.1347 (README request) Mar 31, 2026

Christopher-Lee-McClendon changed the title ~~Non-record: 10L E2E TTT-Linear + FlowRefiner — val_bpb 1.1347 (README request)~~ Non-record: 10L E2E TTT-Linear + FlowRefiner — val_bpb 1.1335 ± 0.0010 (4-seed mean) Mar 31, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: 10L E2E TTT-Linear + FlowRefiner — val_bpb 1.1335 ± 0.0010 (4-seed mean)#1166

Non-record: 10L E2E TTT-Linear + FlowRefiner — val_bpb 1.1335 ± 0.0010 (4-seed mean)#1166
Christopher-Lee-McClendon wants to merge 4 commits intoopenai:mainfrom
Christopher-Lee-McClendon:submission/10L-e2e-ttt-flow-refiner

Christopher-Lee-McClendon commented Mar 31, 2026 •

edited

Loading

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Christopher-Lee-McClendon commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Non-Record Submission: 10L E2E TTT-Linear + FlowRefiner (E2E was a README request)

Summary

Key Results — 4-Seed Reproducibility

Three-Variant Comparison (supplementary)

Prior 11L Ablations on the Same Refiner Pair

Synergy Note

Refiner Hyperparameter Sweeps (11L, PR #549 Base) — NEW

Architecture

Credits

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — Non-record: 10L E2E TTT-Linear + FlowRefiner — val_bpb 1.1335 ± 0.0010 (4-seed mean)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Christopher-Lee-McClendon commented Mar 31, 2026 •

edited

Loading