RecurLoRA: Quantization-Stable Shallow Recurrence with Low-Rank Corrective Adapters by Tanush1912 · Pull Request #1181 · openai/parameter-golf

Tanush1912 · 2026-03-31T15:28:50Z

Summary

RecurLoRA: layers 4–5 repeated once (11 -> 13 virtual layers) with rank-2 LoRA corrections on attention projections + RMSNorm + learnable alpha
Built on PR Record: Split-LR + BigramHash(2816x160) + Full GPTQ + Brotli — val_bpb 1.1105 (3-seed mean) #1179 stack with validated hyperparameter improvements (LeakyReLU 0.9, QK-Gain 4.0, score-first TTT, warmdown 4000)
LoRA overhead: 28KB (0.18% of 16MB budget), kept as fp16 passthrough, avoiding additional quantization error

Why this direction

Weight sharing has consistently failed in this competition due to quantization error accumulation across repeated layers (e.g. PR #363: +4.3 BPB at 3 cycles).

However, PR #686 demonstrated that shallow recurrence (<=2 repeats) remains stable under int6 quantization (~1.1182 BPB), suggesting that limited reuse is viable.

RecurLoRA builds on this by introducing per-pass low-rank corrective adapters:

Shared base weights capture global structure
Rank-2 LoRA adapters specialize each pass (attention only)
RMSNorm + learnable alpha mitigate residual drift

This enables increased effective depth (11 -> 13 layers) without incurring the instability of deep recurrence, effectively reallocating parameters from duplicated layers into increased depth under a fixed 16MB budget.

Status

Implementation complete and validated for:

Forward/backward correctness
Gradient flow across recurrent passes (warm-initialized LoRA: A and B active from step 1)
Parameter budget compliance (28KB overhead)

Full training runs (3 seeds + ablations) queued pending compute.

Test plan

Baseline: hyperparameters only (no recurrence)
Recurrence only: confirm parity with PR Record: Depth Recurrence (layers 4 and 5 repeated): val_bpb 1.1182 #686 (~1.1182 BPB)
RecurLoRA: recurrence + rank-2 LoRA (3 seeds)
Alpha sweep: [0.3, 0.5, 0.6, 0.8]
Report mean +/- std across seeds

Novel contribution: shallow recurrence (layers 4,5 repeated once each) with rank-2 LoRA corrections on attention projections, RMSNorm before repeat, and learnable alpha scaling. 13 virtual layers from 11 physical layers at 28KB (0.18%) parameter overhead. Hyperparameter changes from PR openai#1179 base (1.1105 BPB): - NEGATIVE_SLOPE: 0.5 -> 0.9 (validated +0.013 BPB in issue openai#140) - QK_GAIN_INIT: 1.5 -> 4.0 (validated +0.006 BPB in PR openai#1176) - TTT_ENABLED: 1 (score-first, legal variant) - WARMDOWN_ITERS: 4000 (extended from 3500) - BIGRAM_DIM: 160 (from 112) Status: WIP - awaiting compute for 3-seed validation runs.

Both A and B matrices now initialized with N(0, 1e-3) instead of one being zero. This ensures all LoRA parameters receive gradients from step 1, critical in a 600s training budget where delayed activation wastes precious optimization steps. Alpha default raised from 0.4 to 0.6 to amplify early correction signal.

- Rename submission folder to RecurLoRA_Slope09_QKGain4_TTT - Rewrite README: lead with architectural contribution, add scaling hypothesis, constraint-aware framing, prior failure table - Fix LoRA gradient flow description (warm-init, not cold-start) - Update submission.json title to match

MatoTeziTanka · 2026-04-11T20:05:47Z

Community Review — RecurLoRA: Quantization-Stable Shallow Recurrence with Low-Rank Corrective Adapters

BPB: 1.1182 | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1416/#1423 pattern)

What I found in the code (head SHA dc2fad1ee713, file records/track_10min_16mb/2026-03-31_RecurLoRA_Slope09_QKGain4_TTT/train_gpt.py):

The TTT path at line 432 implements the score-first-per-chunk pattern: each chunk is scored under torch.no_grad() / inference_mode() before the base_model.train() + SGD adaptation runs on that same chunk, with an is_last_chunk guard so the final chunk gets no adaptation pass. This is the structural shape the legal frontier uses (PRs #1416 erichroepke, #1423 aryanbhosale).

Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk ci is scored under weights adapted only on chunks 0..ci-1. No prequant_ttt_adapt_adamw(val_tokens, ...) multi-epoch fine-tune, no scored-region SLOT, no target-in-key n-gram cache.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.18s, dim=512, layers=11, vocab=1024, code=73082 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass.

Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.18s, dim=512, layers=11, vocab=1024, code=73082 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Tanush1912 · 2026-04-11T21:18:10Z

@MatoTeziTanka Thanks for the review! Applied for compute credits but haven't heard back yet, so full 3-seed training runs are still blocked. Will update the PR with results as soon as I can get access to 8xH100s.

MatoTeziTanka · 2026-04-11T21:21:39Z

@MatoTeziTanka Thanks for the review! Applied for compute credits but haven't heard back yet, so full 3-seed training runs are still blocked. Will update the PR with results as soon as I can get access to 8xH100s.

Not to be a Debbie downer but there is plenty of evidence you won't get those credits. You can track this in the open issues as well as the agora and read the red banner staring at you in the face.

Tanush1912 added 3 commits March 31, 2026 16:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RecurLoRA: Quantization-Stable Shallow Recurrence with Low-Rank Corrective Adapters#1181

RecurLoRA: Quantization-Stable Shallow Recurrence with Low-Rank Corrective Adapters#1181
Tanush1912 wants to merge 3 commits intoopenai:mainfrom
Tanush1912:submission/recur-lora-slope09-qkgain4

Tanush1912 commented Mar 31, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Tanush1912 commented Apr 11, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Tanush1912 commented Mar 31, 2026

Summary

Why this direction

Status

Test plan

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — RecurLoRA: Quantization-Stable Shallow Recurrence with Low-Rank Corrective Adapters

Uh oh!

Tanush1912 commented Apr 11, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants