Record: Depth-Recurrent UT + Rank-1 LoRA Per-Iteration Adaptation — val_bpb 1.3342#1096
Record: Depth-Recurrent UT + Rank-1 LoRA Per-Iteration Adaptation — val_bpb 1.3342#1096vimeto wants to merge 1 commit intoopenai:mainfrom
Conversation
Community Review — Record: Depth-Recurrent UT + Rank-1 LoRA Per-Iteration Adaptation — val_bpb 1.3342BPB: 1.3342 | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1416/#1423 pattern) What I found in the code (head SHA The TTT path at line 1258 implements the score-first-per-chunk pattern: each chunk is scored under Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.06s, dim=640, layers=6, vocab=1024, code=110742 B, SMOKE_TEST_PASS Verdict: LOOKS CLEAN. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass. Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually. Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.06s, dim=640, layers=6, vocab=1024, code=110742 B, SMOKE_TEST_PASS. Classification via deterministic AST-based |
Record: Depth-Recurrent UT + Rank-1 LoRA Per-Iteration Adaptation — val_bpb 1.3342
val_bpb = 1.3342 (1 seed, additional seeds pending H100 access) | 11.39 MB | 8xH100 SXM
Results (8xH100 80GB SXM, PyTorch 2.7.1)
Additional seeds pending H100 access.
Key Innovation: Rank-1 LoRA for Stable Per-Iteration Adaptation
Universal Transformer (1 prelude + 4 shared x 3 loops + 1 coda = 14 effective layers from 6 unique blocks) at 640d — a dimension that flat transformers cannot fit in 16 MB (would be 18.2 MB).
Each loop iteration gets a unique rank-1 weight modification via outer product of two learned vectors (on AdamW, not Muon):
This is the first stable per-iteration adaptation for recurrent transformers in this competition. We conducted 8 failed training runs with rank-8 LoRA before discovering the root cause.
Why Rank-8 LoRA Diverges
Muon's Newton-Schulz applies
scale = sqrt(rows/cols)per parameter. For rank-8 LoRA B matrices (576x8),scale = sqrt(72) = 8.49x. This amplifies B updates 8.5x relative to A, creating a positive feedback loop that diverges after ~1500 steps.Rank-1 fix: Rank-1 LoRA params are 1D vectors, not 2D matrices. Vectors go to AdamW (no Muon scale). Problem eliminated.
Stability Techniques
Artifact: Only 11.39 MB (4.61 MB free)
The 640d recurrent model uses only 11.39 MB — leaving 4.61 MB for potential n-gram cache integration.
Credits