Record: Alpha-Scaled LoRA + Warm-start A + WD 1.0 — val_bpb 1.07266 (3-seed mean) by renqianluo · Pull Request #1765 · openai/parameter-golf

renqianluo · 2026-04-21T22:22:51Z

Summary

Three composable, small-LOC novel changes on top of @dexhunter's 1.07193 phased-TTT pipeline. All three modify only the LoRA adapter's init / forward / weight-decay — everything else (VarLen attention, Fused MLP, multi-phase global SGD, trimmed GPTQ, int7 embeddings, triple depth recurrence) is unchanged.

(1) Alpha/rank output scaling on `BatchedLinearLoRA`

Prior phased-TTT code uses forward(x) = (x @ A.T) @ B.T with no rank scaling. Raising TTT_LORA_RANK from 96 → 128 on that code diverges on seeds 314 and 1337 (TTT BPB collapses to ~1.133) while working on seed 42. Adding the standard alpha/rank factor decouples rank from effective LR and makes rank 128 stable on all three seeds.

(2) Warm-start LoRA A across batches

Previously A is re-randomized every batch. Keeping A warm (only B resets to zero) lets the A matrix accumulate feature directions across the ~780 phased-TTT batches. In isolation this helps seeds 42/1337 but regresses on seed 314 (A drifts into overfit for that seed's doc ordering).

(3) Raised TTT weight decay 0.5 → 1.0

Introduced specifically to counteract the across-batch A overfit enabled by (2). On seed 314 it restores parity with the rank-96 baseline (1.07200 → 1.07203); on 42/1337 the bulk of the warm-start gain is preserved.

Results

Seed	rank-96 baseline	+ alpha rank 128	+ warm-A + WD=1.0
1337	1.07423	1.07379	1.07298 (−0.00125)
42	1.07341	1.07320	1.07298 (−0.00043)
314	1.07214	1.07200	1.07203 (−0.00011)
Mean	1.07326	1.07300	1.07266 (−0.00060)

All three seeds improve or stay flat vs the rank-96 reproduction.

Compliance

All runs under 600s train, under 600s eval, <16MB artifact. Issue #1017 conditions 1–4 verified (causal, full normalized distribution, score-before-update, single pass).

…(3-seed mean) Three composable novel changes on top of dexhunter's phased-TTT code: 1. Alpha/rank LoRA scaling enables stable higher rank (128 vs 96) 2. Warm-start LoRA A across batches lets feature directions accumulate 3. Raised TTT weight decay (0.5 -> 1.0) prevents warm-A overfit on seed 314 3-seed mean 1.07266 BPB (seeds 1337, 42, 314). Every seed improves or ties vs the rank-96 reproduction baseline. Inherits samacqua's VarLen+FusedMLP (PR openai#1530), romeerp's phased TTT (PR openai#1610), bigbag's triple recurrence + parallel residuals (PR openai#1493), EthanYangTW's parameter banking (PR openai#1523), dexhunter's multi-phase global SGD + trimmed GPTQ.

renqianluo changed the title ~~Alpha-Scaled LoRA + Warm-start A + WD 1.0 — val_bpb 1.07266 (3-seed mean)~~ Record: Alpha-Scaled LoRA + Warm-start A + WD 1.0 — val_bpb 1.07266 (3-seed mean) Apr 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Alpha-Scaled LoRA + Warm-start A + WD 1.0 — val_bpb 1.07266 (3-seed mean)#1765

Record: Alpha-Scaled LoRA + Warm-start A + WD 1.0 — val_bpb 1.07266 (3-seed mean)#1765
renqianluo wants to merge 1 commit intoopenai:mainfrom
renqianluo:record/alpha-warm-wd-1.07266

renqianluo commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

renqianluo commented Apr 21, 2026

Summary

(1) Alpha/rank output scaling on BatchedLinearLoRA

(2) Warm-start LoRA A across batches

(3) Raised TTT weight decay 0.5 → 1.0

Results

Compliance

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

(1) Alpha/rank output scaling on `BatchedLinearLoRA`