Record: Depth-Recurrent UT + Rank-1 LoRA Per-Iteration Adaptation — val_bpb 1.3342 by vimeto · Pull Request #1096 · openai/parameter-golf

vimeto · 2026-03-29T21:06:15Z

Record: Depth-Recurrent UT + Rank-1 LoRA Per-Iteration Adaptation — val_bpb 1.3342

val_bpb = 1.3342 (1 seed, additional seeds pending H100 access) | 11.39 MB | 8xH100 SXM

Results (8xH100 80GB SXM, PyTorch 2.7.1)

Seed	step_avg	steps	sliding_bpb	Artifact
1337	125ms	4,769	1.334	11,385,022
42	pending
2025	pending

Additional seeds pending H100 access.

Key Innovation: Rank-1 LoRA for Stable Per-Iteration Adaptation

Universal Transformer (1 prelude + 4 shared x 3 loops + 1 coda = 14 effective layers from 6 unique blocks) at 640d — a dimension that flat transformers cannot fit in 16 MB (would be 18.2 MB).

Each loop iteration gets a unique rank-1 weight modification via outer product of two learned vectors (on AdamW, not Muon):

delta_W = b (out_dim,) outer a (in_dim,)   # rank-1 matrix, ~9K total params
W_effective = W_shared + delta_W             # different at each iteration

This is the first stable per-iteration adaptation for recurrent transformers in this competition. We conducted 8 failed training runs with rank-8 LoRA before discovering the root cause.

Why Rank-8 LoRA Diverges

Muon's Newton-Schulz applies scale = sqrt(rows/cols) per parameter. For rank-8 LoRA B matrices (576x8), scale = sqrt(72) = 8.49x. This amplifies B updates 8.5x relative to A, creating a positive feedback loop that diverges after ~1500 steps.

Rank-1 fix: Rank-1 LoRA params are 1D vectors, not 2D matrices. Vectors go to AdamW (no Muon scale). Problem eliminated.

Attempt	Fix	Result
v1-v2	Muon, various LR	Diverged step 1500
v3	AdamW for LoRA	Too slow, diverged step 3000
v4-v5	Grad scaling + warmup	Diverged step 3500
v6	Muon scale=1.0 override	Diverged step 4000
v8 (this)	Rank-1 vectors on AdamW	Stable, 1.334 BPB

Stability Techniques

Output-LN (Peri-LN, arXiv:2502.02732) on shared blocks
Birkhoff-constrained mixing (sigmoid, spectral norm <= 1)
Capped timestep scaling (per-effective-layer, FP16 passthrough)
Noisy QAT (INT6-calibrated noise on shared weights)

Artifact: Only 11.39 MB (4.61 MB free)

The 640d recurrent model uses only 11.39 MB — leaving 4.61 MB for potential n-gram cache integration.

Credits

PR Non-Record: Depth Recurrence Research — 20 Ablation Runs, 8 Techniques, 5 Series (best val_bpb=1.2624, 14 eff layers from 6 unique blocks) #855 (@aazizyan) — Output-LN, Birkhoff mixing, first viable 3-loop recurrence
PR Non-record: 4-Hour Progressive Depth — val_bpb 1.0889 #895 (@iverbovoy) — Progressive depth concept
PR Non-record: Depth Recurrence in Parameter-Constrained Transformers — What Works, What Doesn't, and Why #363 (@evangelinehelsinki) — Noisy QAT, negative results on recurrence
arXiv:2502.13181 — RingFormer level signals (inspiration)

…al_bpb 1.3342

MatoTeziTanka · 2026-04-11T20:12:55Z

Community Review — Record: Depth-Recurrent UT + Rank-1 LoRA Per-Iteration Adaptation — val_bpb 1.3342

BPB: 1.3342 | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1416/#1423 pattern)

What I found in the code (head SHA 35488484b1cc, file records/track_10min_16mb/2026-03-30_UT_Rank1LoRA_OutputLN_Birkhoff/train_gpt.py):

The TTT path at line 1258 implements the score-first-per-chunk pattern: each chunk is scored under torch.no_grad() / inference_mode() before the base_model.train() + SGD adaptation runs on that same chunk, with an is_last_chunk guard so the final chunk gets no adaptation pass. This is the structural shape the legal frontier uses (PRs #1416 erichroepke, #1423 aryanbhosale).

Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk ci is scored under weights adapted only on chunks 0..ci-1. No prequant_ttt_adapt_adamw(val_tokens, ...) multi-epoch fine-tune, no scored-region SLOT, no target-in-key n-gram cache.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.06s, dim=640, layers=6, vocab=1024, code=110742 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass.

Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.06s, dim=640, layers=6, vocab=1024, code=110742 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Record: Depth-Recurrent UT + Rank-1 LoRA Per-Iteration Adaptation — v…

3548848

…al_bpb 1.3342

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Depth-Recurrent UT + Rank-1 LoRA Per-Iteration Adaptation — val_bpb 1.3342#1096

Record: Depth-Recurrent UT + Rank-1 LoRA Per-Iteration Adaptation — val_bpb 1.3342#1096
vimeto wants to merge 1 commit intoopenai:mainfrom
vimeto:pr/ut-rank1-lora

vimeto commented Mar 29, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vimeto commented Mar 29, 2026