Non-Record: Depth Recurrence Research — 20 Ablation Runs, 8 Techniques, 5 Series (best val_bpb=1.2624, 14 eff layers from 6 unique blocks) by aazizyan · Pull Request #855 · openai/parameter-golf

aazizyan · 2026-03-26T14:36:10Z

Depth Recurrence in Parameter-Constrained Transformers: A Systematic Study

20 ablation runs across 5 series testing 8 techniques for stabilizing depth recurrence under 16MB int8+zlib quantization. Three novel stabilization techniques enable 3-loop recurrence for the first time in competition history. Five additional techniques tested with documented positive and negative results.

Best Results

Config	Post-Q BPB	Q-gap	Artifact	Note
1+4×3+1 full share + FiLM + sinusoidal depth (Run T)	1.2624	+0.0073	10.7MB	Best practical config, ~4.8MB headroom
1+4×2+1 shared attn + unique MLPs (Run L)	1.2406	+0.0073	14.7MB	Best absolute, but no room for SOTA

Techniques That Work

Technique	Delta	Cost
Output-LN	−0.007 BPB	Zero
Prelude-coda	−0.016 BPB	More unique params
Birkhoff mixing	Enables 3-loop stability	Zero
Timestep scaling (γ)	Q-gap −26-30%	~8KB FP16
FiLM bias (β)	−0.003 BPB	~8KB FP16
Sinusoidal depth encoding	Q-gap −0.0005	Zero (non-persistent buffer)

Techniques That Don't Work (documented negative results)

Technique	Result	Why
Learned depth embeddings	+0.0014 BPB worse	Throughput overhead, values stayed near zero
Unique input norm gains	+0.0004 BPB worse	MLP gains didn't move from 1.0, redundant with Output-LN
Unique MLPs (attn-only sharing)	−0.026 BPB best result	Too expensive: 14.7MB artifact, no SOTA headroom

Key Findings

Timestep scaling helps quantization, not training — float16 passthrough params bypass int8, reducing Q-gap 26-30% with zero pre-quant BPB effect
MLP needs weight-space differentiation, not input-space modulation — unique MLPs give −0.026 BPB, but cheap input controls (norms, depth embeddings) give nothing
ALBERT's finding confirmed at 512d — attention sharing is nearly free, FFN sharing causes most degradation
Q-gap scales with training duration — screening underestimates quantization problems 4-7×
Sinusoidal > learned for depth encoding — zero cost, same Q-gap benefit, 0.0015 BPB better due to throughput savings

Validated Stack for SOTA Integration

Output-LN + Birkhoff mixing + FiLM scale+shift + sinusoidal depth encoding. Total FP16 passthrough: ~50KB. Artifact: ~10.7MB. Headroom for SOTA features: ~4.8MB.

20 Runs Across 5 Series

Series 1 (7 screening runs): technique isolation on 1×H100
Series 2 (5 full-scale runs): 8×H100 validation, Run K = first viable 3-loop (1.2659)
Series 3 (4 runs): FiLM bias (−0.003) + attention-only sharing (−0.026 but too expensive)
Series 4 (4 runs): learned depth embeddings + unique norms (negative result)
Series 5 (1 run): sinusoidal depth encoding (free, marginal Q-gap benefit)

See research_notes.md for theory, 14 citations, and detailed analysis.

Credits

Built on insights from:

PR Non-record: Depth Recurrence in Parameter-Constrained Transformers — What Works, What Doesn't, and Why #363 (@evangelinehelsinki) — quantization error amplification measurement
andrew (Discord) — attention-only sharing suggestion, FiLM bias idea
ALBERT, MoEUT, Universal Transformer, Relaxed Recursive Transformer, Huginn

…niques

aazizyan · 2026-03-26T14:41:29Z

Some untested directions that might be worth exploring:

These three techniques on shallow recurrence (repeat 1-2 layers on the SOTA stack) — the Q-gap reduction from timestep scaling could be meaningful at the frontier
Int6/GPTQ interaction with Birkhoff mixing — sigmoid values in [0,1] should quantize cleanly at any bit width
Output-LN on non-recurrent models — may help even without weight sharing, since it lets MLP see unnormalized inputs while bounding output
Gamma cap ablation (2.0 vs 4.0 vs 8.0) — the cap value was chosen empirically, not optimized
QAT combined with Birkhoff + Output-LN + timestep scaling — QAT has been tried for recurrence before, but not with these stabilization techniques in place

…ve result)

…series 4-5

aazizyan · 2026-04-02T18:41:23Z

Follow-up: PR #1204 (@msisovic, 1.1063 BPB) independently confirms two findings from this study — attention sharing is free while MLP needs unique weights (they use REPEAT_UNTIE_MLP=full), and shallow recurrence beats deep. Techniques from this PR not yet tested on their stack: Output-LN, Birkhoff mixing, FiLM scale+shift.

MatoTeziTanka · 2026-04-12T13:31:57Z

Community Review — Non-Record: Depth Recurrence Research — 20 Ablation Runs, 8 Techniques, 5 Series (best val_bpb=1.2624, 14 eff layers from 6 unique blocks)

Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

PR #855 — Audit Summary Author: Alexandr Azizyan (@aazizyan) Submission: "First Viable 3-Loop Recurrence: Birkhoff + Output-LN + Timestep Scaling" Track: non-record-16mb | val_bpb: 1.26586418 (pre-quant: 1.2583) Head SHA: `5e31104` --- ### Check 1: ILLEGAL n-gram family bug (target XOR'd into hash key / BigramHash) CLEAR. No hash table, no BigramHash class, no XOR operations on token IDs, no n-gram lookup structures anywhere in train_gpt.py. The only lookup tables are the sentencepiece BPB accounting LUTs (`base_bytes_lut`, `has_leading_space_lut`, `is_boundary_token_lut`) built at lines 206–230 — these are read-only scoring utilities, not model components. ### Check 2: ILLEGAL Pre-Quant TTT (multi-epoch gradient update on val_tokens before scoring) CLEAR. `val_tokens` is loaded once at line 1092 and used exclusively in `eval_val()` (lines 245–304), which runs under `torch.inference_mode()` and `model.eval()`. No backward pass, no optimizer step, and no weight mutation is ever performed on `val_tokens`. The post-quant roundtrip eval at lines 1430–1447 is also purely inference on the dequantized model. ### Check 3: LEGAL score-first TTT (PR #1413 pattern, is_last_chunk guard) NOT PRESENT. No TTT mechanism of any kind exists — not score-first, not any variant. This submission is vanilla train-then-eval with no test-time adaptation. ### Check 4: HOLD scored-region SLOT NOT APPLICABLE. No scored-region slot manipulation detected. ### Check 5: CLEAN pure neural CONFIRMED. The submission is a standard pure-neural transformer with: - 3-loop recurrence (1 prelude + 4 shared × 3 loops + 1 coda = 14 effective layers, 6 unique blocks) - Birkhoff-constrained residual mixing (`resid_mix_logit` → sigmoid alpha, lines 718,...

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the audit — this looks like a clean pure-neural submission.

Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

aazizyan added 7 commits March 26, 2026 17:57

feat: add modified training script with recurrence stabilization tech…

8dc11e2

…niques

feat: add screening experiment scripts and logs (7 runs, 2000 steps)

43f7438

feat: add full-scale experiment scripts and logs (5 runs, 600s 8xH100)

5603ee0

docs: add research notes with theory and citations

ab1db1b

chore: add submission metadata and primary run log

52442c9

docs: add README for PR submission

37e6422

docs: polish README and research notes for PR submission

a3a8613

vimeto mentioned this pull request Mar 29, 2026

Record: Depth-Recurrent UT + Rank-1 LoRA Per-Iteration Adaptation — val_bpb 1.3342 #1096

Draft

aazizyan added 3 commits March 31, 2026 19:01

feat: add FiLM bias and attention-only sharing ablations (Series 3)

1cffe4d

feat: add depth embeddings + unique norms ablations (Series 4, negati…

6a9eeaa

…ve result)

feat: add sinusoidal depth encoding, unique norms, complete ablation …

5e31104

…series 4-5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-Record: Depth Recurrence Research — 20 Ablation Runs, 8 Techniques, 5 Series (best val_bpb=1.2624, 14 eff layers from 6 unique blocks)#855

Non-Record: Depth Recurrence Research — 20 Ablation Runs, 8 Techniques, 5 Series (best val_bpb=1.2624, 14 eff layers from 6 unique blocks)#855
aazizyan wants to merge 10 commits intoopenai:mainfrom
aazizyan:research/RecurrenceFix_3Loop_Birkhoff_OutputLN_TimestepScale

aazizyan commented Mar 26, 2026 •

edited

Loading

Uh oh!

aazizyan commented Mar 26, 2026

Uh oh!

aazizyan commented Apr 2, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aazizyan commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Depth Recurrence in Parameter-Constrained Transformers: A Systematic Study

Best Results

Techniques That Work

Techniques That Don't Work (documented negative results)

Key Findings

Validated Stack for SOTA Integration

20 Runs Across 5 Series

Credits

Uh oh!

aazizyan commented Mar 26, 2026

Uh oh!

aazizyan commented Apr 2, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Community Review — Non-Record: Depth Recurrence Research — 20 Ablation Runs, 8 Techniques, 5 Series (best val_bpb=1.2624, 14 eff layers from 6 unique blocks)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aazizyan commented Mar 26, 2026 •

edited

Loading