Skip to content

SP8192 + 4-Layer Depth Recurrence (loop_end=6)#1678

Open
tashapais wants to merge 2 commits intoopenai:mainfrom
tashapais:submission/sp8192-4layer-recurrence
Open

SP8192 + 4-Layer Depth Recurrence (loop_end=6)#1678
tashapais wants to merge 2 commits intoopenai:mainfrom
tashapais:submission/sp8192-4layer-recurrence

Conversation

@tashapais
Copy link
Copy Markdown

@tashapais tashapais commented Apr 16, 2026

Summary

Extends the current SOTA (val_bpb=1.0810, PR #1509) by widening depth recurrence from 3 looped layers to 4 (LOOP_END=6). All other hyperparameters and techniques are unchanged.

The only code change from SOTA:

# Before (SOTA, PR #1509)
loop_end=int(os.environ.get('LOOP_END', 5))
qk_gain_init=float(os.environ.get('QK_GAIN_INIT', 5.))

# After (this PR)
loop_end=int(os.environ.get('LOOP_END', 6))
qk_gain_init=float(os.environ.get('QK_GAIN_INIT', 5.25))  # absorbs known-good default

Architecture

Virtual layer sequences

SOTA (loop_end=5) — 17 virtual layers, 8 U-Net skips:

Encoder [0,1,2,3,4,5,3,4]     Decoder [5,3,4,5,6,7,8,9,10]
         pre └──loop──┘ 2nd          2nd └──loop──┘ post (par)

This PR (loop_end=6) — 19 virtual layers, 9 U-Net skips:

Encoder [0,1,2,3,4,5,6,3,4]   Decoder [5,6,3,4,5,6,7,8,9,10]
         pre └────loop────┘ 2nd       2nd └────loop────┘ post (par)

Layer 6 is promoted from the non-recurring post-loop section into the recurrence core. It now executes 3 times (like layers 3, 4, 5) instead of once. The 4-layer loop also adds one new U-Net skip connection (9 vs 8 pairs).

Compute Budget Equivalence

The 4-layer loop is slower per step but the total layer-step budget is identical:

Config Virtual layers Est. steps Layer-steps
SOTA 17 ~4,550 ~77,350
This PR 19 ~4,071 ~77,349

Prior depth-recurrence results show monotonic improvement with virtual depth at equal compute:

Submission Looped layers Virtual layers val_bpb (no TTT)
PR #1260 (2-layer loop) [4,5] ~13 1.0979
PR #1394 (3-layer loop) [3,4,5] 17 1.0856
This PR (4-layer loop) [3,4,5,6] 19 pending

Local Verification

test_architecture.py validates both configs on CPU (no CUDA, no flash_attn, standard PyTorch only):

Config: SOTA (loop_end=5)
  encoder: [0, 1, 2, 3, 4, 5, 3, 4]
  decoder: [5, 3, 4, 5, 6, 7, 8, 9, 10]
  virtual_layers=17, skips=8
  forward pass: OK  shape=torch.Size([2, 32, 256])
  gradient flow: OK  loss=5.5465
  all looped blocks have clean gradients: OK

Config: This PR (loop_end=6)
  encoder: [0, 1, 2, 3, 4, 5, 6, 3, 4]
  decoder: [5, 6, 3, 4, 5, 6, 7, 8, 9, 10]
  virtual_layers=19, skips=9
  forward pass: OK  shape=torch.Size([2, 32, 256])
  gradient flow: OK  loss=5.5435
  all looped blocks have clean gradients: OK

All architecture tests PASSED.

Reproduction

pip install brotli sentencepiece
pip install flash_attn_3 --no-deps --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/
MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp8192

SEED=42 TTT_ENABLED=1 TTT_LR=0.005 TTT_EPOCHS=3 \
  torchrun --standalone --nproc_per_node=8 train_gpt.py

No extra env vars — QK_GAIN_INIT=5.25 and LOOP_END=6 are now script defaults.

Results pending on 8xA100 hardware.

Test plan

  • Run 3 seeds (42, 314, 999) on 8xA100s
  • Verify training completes under 600s
  • Verify artifact under 16,000,000 bytes
  • Verify sliding-window + TTT eval under 600s
  • Report val_bpb for each seed

Extends the SOTA 3-layer depth recurrence to 4 layers by setting
LOOP_END=6 (was 5), creating 19 virtual layers from 11 physical
instead of 17. Total layer-step compute budget is identical (~77,350).

Also absorbs QK_GAIN_INIT=5.25 as the script default (was passed
as an env var in the prior SOTA submission).

Results pending on 8xA100 hardware.
- test_architecture.py: CPU-only PyTorch test verifying forward pass,
  gradient flow, and virtual layer structure for both SOTA and 4-layer
  configs; all checks pass locally
- README: add virtual layer diagrams, skip-connection table, compute
  budget equivalence analysis, and verified test output
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
… candidates

User shared a deep timeline of all recurrence experiments in the
PG competition (openai#8 through openai#1739). Several of my previously-proposed
experiments have ALREADY BEEN TESTED ON THIS STACK and shown to fail:

KILLED:
- Timing sweep earlier: openai#1726 showed 0.15 is +0.050 worse; openai#1739
  showed step-0 catastrophic (1.3936 bpb)
- Progressive ramp: openai#1663 showed hard-onset = smooth, no difference
- Position shift: openai#1726 showed layer 2-7 +0.163 worse, layer 5-6 shift
  +0.006 worse — layer 3-5 IS the empirical sweet spot

Also corrected the baseline config: openai#1736 uses LOOP_START=3 LOOP_END=5
(three layers: 3, 4, 5 — "Loop345"), not Loop45 as directory name
suggests. 3 layers × 3 passes = 17 virtual layers.

VIABLE candidates:
- Recur-Alpha (openai#1714, Anakintano): learnable scalar per looped block,
  init 0 → identity. 6 params. Author's grant ran out before TTT eval
  so composition with openai#1736's phased TTT is genuinely open. NEW TOP PICK.
- Cross-pass XSA: still novel, untested in any PR
- Loop3-6 variant (openai#1678): tashapais running it; might wait for result

Recommendation updated: port Recur-Alpha onto openai#1736 as spec 015.
~$25, identity-at-init (safe), 30 LOC, direct recurrence question.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant