Non-Record: Depth Recurrence Research — 20 Ablation Runs, 8 Techniques, 5 Series (best val_bpb=1.2624, 14 eff layers from 6 unique blocks)#855
Conversation
|
Some untested directions that might be worth exploring:
|
|
Follow-up: PR #1204 (@msisovic, 1.1063 BPB) independently confirms two findings from this study — attention sharing is free while MLP needs unique weights (they use REPEAT_UNTIE_MLP=full), and shallow recurrence beats deep. Techniques from this PR not yet tested on their stack: Output-LN, Birkhoff mixing, FiLM scale+shift. |
Community Review — Non-Record: Depth Recurrence Research — 20 Ablation Runs, 8 Techniques, 5 Series (best val_bpb=1.2624, 14 eff layers from 6 unique blocks)Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache PR #855 — Audit Summary Author: Alexandr Azizyan (@aazizyan) Submission: "First Viable 3-Loop Recurrence: Birkhoff + Output-LN + Timestep Scaling" Track: non-record-16mb | val_bpb: 1.26586418 (pre-quant: 1.2583) Head SHA: 5e31104 --- ### Check 1: ILLEGAL n-gram family bug (target XOR'd into hash key / BigramHash) CLEAR. No hash table, no BigramHash class, no XOR operations on token IDs, no n-gram lookup structures anywhere in train_gpt.py. The only lookup tables are the sentencepiece BPB accounting LUTs (
|
Depth Recurrence in Parameter-Constrained Transformers: A Systematic Study
20 ablation runs across 5 series testing 8 techniques for stabilizing depth recurrence under 16MB int8+zlib quantization. Three novel stabilization techniques enable 3-loop recurrence for the first time in competition history. Five additional techniques tested with documented positive and negative results.
Best Results
Techniques That Work
Techniques That Don't Work (documented negative results)
Key Findings
Validated Stack for SOTA Integration
Output-LN + Birkhoff mixing + FiLM scale+shift + sinusoidal depth encoding. Total FP16 passthrough: ~50KB. Artifact: ~10.7MB. Headroom for SOTA features: ~4.8MB.
20 Runs Across 5 Series
See research_notes.md for theory, 14 citations, and detailed analysis.
Credits
Built on insights from: