Skip to content

Non-Record: Depth Recurrence Research — 20 Ablation Runs, 8 Techniques, 5 Series (best val_bpb=1.2624, 14 eff layers from 6 unique blocks)#855

Open
aazizyan wants to merge 10 commits intoopenai:mainfrom
aazizyan:research/RecurrenceFix_3Loop_Birkhoff_OutputLN_TimestepScale
Open

Non-Record: Depth Recurrence Research — 20 Ablation Runs, 8 Techniques, 5 Series (best val_bpb=1.2624, 14 eff layers from 6 unique blocks)#855
aazizyan wants to merge 10 commits intoopenai:mainfrom
aazizyan:research/RecurrenceFix_3Loop_Birkhoff_OutputLN_TimestepScale

Conversation

@aazizyan
Copy link
Copy Markdown

@aazizyan aazizyan commented Mar 26, 2026

Depth Recurrence in Parameter-Constrained Transformers: A Systematic Study

20 ablation runs across 5 series testing 8 techniques for stabilizing depth recurrence under 16MB int8+zlib quantization. Three novel stabilization techniques enable 3-loop recurrence for the first time in competition history. Five additional techniques tested with documented positive and negative results.

Best Results

Config Post-Q BPB Q-gap Artifact Note
1+4×3+1 full share + FiLM + sinusoidal depth (Run T) 1.2624 +0.0073 10.7MB Best practical config, ~4.8MB headroom
1+4×2+1 shared attn + unique MLPs (Run L) 1.2406 +0.0073 14.7MB Best absolute, but no room for SOTA

Techniques That Work

Technique Delta Cost
Output-LN −0.007 BPB Zero
Prelude-coda −0.016 BPB More unique params
Birkhoff mixing Enables 3-loop stability Zero
Timestep scaling (γ) Q-gap −26-30% ~8KB FP16
FiLM bias (β) −0.003 BPB ~8KB FP16
Sinusoidal depth encoding Q-gap −0.0005 Zero (non-persistent buffer)

Techniques That Don't Work (documented negative results)

Technique Result Why
Learned depth embeddings +0.0014 BPB worse Throughput overhead, values stayed near zero
Unique input norm gains +0.0004 BPB worse MLP gains didn't move from 1.0, redundant with Output-LN
Unique MLPs (attn-only sharing) −0.026 BPB best result Too expensive: 14.7MB artifact, no SOTA headroom

Key Findings

  1. Timestep scaling helps quantization, not training — float16 passthrough params bypass int8, reducing Q-gap 26-30% with zero pre-quant BPB effect
  2. MLP needs weight-space differentiation, not input-space modulation — unique MLPs give −0.026 BPB, but cheap input controls (norms, depth embeddings) give nothing
  3. ALBERT's finding confirmed at 512d — attention sharing is nearly free, FFN sharing causes most degradation
  4. Q-gap scales with training duration — screening underestimates quantization problems 4-7×
  5. Sinusoidal > learned for depth encoding — zero cost, same Q-gap benefit, 0.0015 BPB better due to throughput savings

Validated Stack for SOTA Integration

Output-LN + Birkhoff mixing + FiLM scale+shift + sinusoidal depth encoding. Total FP16 passthrough: ~50KB. Artifact: ~10.7MB. Headroom for SOTA features: ~4.8MB.

20 Runs Across 5 Series

  • Series 1 (7 screening runs): technique isolation on 1×H100
  • Series 2 (5 full-scale runs): 8×H100 validation, Run K = first viable 3-loop (1.2659)
  • Series 3 (4 runs): FiLM bias (−0.003) + attention-only sharing (−0.026 but too expensive)
  • Series 4 (4 runs): learned depth embeddings + unique norms (negative result)
  • Series 5 (1 run): sinusoidal depth encoding (free, marginal Q-gap benefit)

See research_notes.md for theory, 14 citations, and detailed analysis.

Credits

Built on insights from:

@aazizyan
Copy link
Copy Markdown
Author

Some untested directions that might be worth exploring:

  • These three techniques on shallow recurrence (repeat 1-2 layers on the SOTA stack) — the Q-gap reduction from timestep scaling could be meaningful at the frontier
  • Int6/GPTQ interaction with Birkhoff mixing — sigmoid values in [0,1] should quantize cleanly at any bit width
  • Output-LN on non-recurrent models — may help even without weight sharing, since it lets MLP see unnormalized inputs while bounding output
  • Gamma cap ablation (2.0 vs 4.0 vs 8.0) — the cap value was chosen empirically, not optimized
  • QAT combined with Birkhoff + Output-LN + timestep scaling — QAT has been tried for recurrence before, but not with these stabilization techniques in place

@aazizyan aazizyan changed the title Non-Record: First Viable 3-Loop Recurrence — Birkhoff + Output-LN + Timestep Scaling (val_bpb=1.2659, 14 eff layers from 6 unique blocks) Non-Record: Depth Recurrence Research — 20 Ablation Runs, 8 Techniques, 5 Series (best val_bpb=1.2624, 14 eff layers from 6 unique blocks) Apr 2, 2026
@aazizyan
Copy link
Copy Markdown
Author

aazizyan commented Apr 2, 2026

Follow-up: PR #1204 (@msisovic, 1.1063 BPB) independently confirms two findings from this study — attention sharing is free while MLP needs unique weights (they use REPEAT_UNTIE_MLP=full), and shallow recurrence beats deep. Techniques from this PR not yet tested on their stack: Output-LN, Birkhoff mixing, FiLM scale+shift.

@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Non-Record: Depth Recurrence Research — 20 Ablation Runs, 8 Techniques, 5 Series (best val_bpb=1.2624, 14 eff layers from 6 unique blocks)

Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

PR #855 — Audit Summary Author: Alexandr Azizyan (@aazizyan) Submission: "First Viable 3-Loop Recurrence: Birkhoff + Output-LN + Timestep Scaling" Track: non-record-16mb | val_bpb: 1.26586418 (pre-quant: 1.2583) Head SHA: 5e31104 --- ### Check 1: ILLEGAL n-gram family bug (target XOR'd into hash key / BigramHash) CLEAR. No hash table, no BigramHash class, no XOR operations on token IDs, no n-gram lookup structures anywhere in train_gpt.py. The only lookup tables are the sentencepiece BPB accounting LUTs (base_bytes_lut, has_leading_space_lut, is_boundary_token_lut) built at lines 206–230 — these are read-only scoring utilities, not model components. ### Check 2: ILLEGAL Pre-Quant TTT (multi-epoch gradient update on val_tokens before scoring) CLEAR. val_tokens is loaded once at line 1092 and used exclusively in eval_val() (lines 245–304), which runs under torch.inference_mode() and model.eval(). No backward pass, no optimizer step, and no weight mutation is ever performed on val_tokens. The post-quant roundtrip eval at lines 1430–1447 is also purely inference on the dequantized model. ### Check 3: LEGAL score-first TTT (PR #1413 pattern, is_last_chunk guard) NOT PRESENT. No TTT mechanism of any kind exists — not score-first, not any variant. This submission is vanilla train-then-eval with no test-time adaptation. ### Check 4: HOLD scored-region SLOT NOT APPLICABLE. No scored-region slot manipulation detected. ### Check 5: CLEAN pure neural CONFIRMED. The submission is a standard pure-neural transformer with: - 3-loop recurrence (1 prelude + 4 shared × 3 loops + 1 coda = 14 effective layers, 6 unique blocks) - Birkhoff-constrained residual mixing (resid_mix_logit → sigmoid alpha, lines 718,...

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the audit — this looks like a clean pure-neural submission.


Reviewed by @MatoTeziTankaThe Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants