SP8192 + 4-Layer Depth Recurrence (loop_end=6) by tashapais · Pull Request #1678 · openai/parameter-golf

tashapais · 2026-04-16T19:14:23Z

Summary

Extends the current SOTA (val_bpb=1.0810, PR #1509) by widening depth recurrence from 3 looped layers to 4 (LOOP_END=6). All other hyperparameters and techniques are unchanged.

The only code change from SOTA:

# Before (SOTA, PR #1509)
loop_end=int(os.environ.get('LOOP_END', 5))
qk_gain_init=float(os.environ.get('QK_GAIN_INIT', 5.))

# After (this PR)
loop_end=int(os.environ.get('LOOP_END', 6))
qk_gain_init=float(os.environ.get('QK_GAIN_INIT', 5.25))  # absorbs known-good default

Architecture

Virtual layer sequences

SOTA (loop_end=5) — 17 virtual layers, 8 U-Net skips:

Encoder [0,1,2,3,4,5,3,4]     Decoder [5,3,4,5,6,7,8,9,10]
         pre └──loop──┘ 2nd          2nd └──loop──┘ post (par)

This PR (loop_end=6) — 19 virtual layers, 9 U-Net skips:

Encoder [0,1,2,3,4,5,6,3,4]   Decoder [5,6,3,4,5,6,7,8,9,10]
         pre └────loop────┘ 2nd       2nd └────loop────┘ post (par)

Layer 6 is promoted from the non-recurring post-loop section into the recurrence core. It now executes 3 times (like layers 3, 4, 5) instead of once. The 4-layer loop also adds one new U-Net skip connection (9 vs 8 pairs).

Compute Budget Equivalence

The 4-layer loop is slower per step but the total layer-step budget is identical:

Config	Virtual layers	Est. steps	Layer-steps
SOTA	17	~4,550	~77,350
This PR	19	~4,071	~77,349

Prior depth-recurrence results show monotonic improvement with virtual depth at equal compute:

Submission	Looped layers	Virtual layers	val_bpb (no TTT)
PR #1260 (2-layer loop)	[4,5]	~13	1.0979
PR #1394 (3-layer loop)	[3,4,5]	17	1.0856
This PR (4-layer loop)	[3,4,5,6]	19	pending

Local Verification

test_architecture.py validates both configs on CPU (no CUDA, no flash_attn, standard PyTorch only):

Config: SOTA (loop_end=5)
  encoder: [0, 1, 2, 3, 4, 5, 3, 4]
  decoder: [5, 3, 4, 5, 6, 7, 8, 9, 10]
  virtual_layers=17, skips=8
  forward pass: OK  shape=torch.Size([2, 32, 256])
  gradient flow: OK  loss=5.5465
  all looped blocks have clean gradients: OK

Config: This PR (loop_end=6)
  encoder: [0, 1, 2, 3, 4, 5, 6, 3, 4]
  decoder: [5, 6, 3, 4, 5, 6, 7, 8, 9, 10]
  virtual_layers=19, skips=9
  forward pass: OK  shape=torch.Size([2, 32, 256])
  gradient flow: OK  loss=5.5435
  all looped blocks have clean gradients: OK

All architecture tests PASSED.

Reproduction

pip install brotli sentencepiece
pip install flash_attn_3 --no-deps --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/
MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp8192

SEED=42 TTT_ENABLED=1 TTT_LR=0.005 TTT_EPOCHS=3 \
  torchrun --standalone --nproc_per_node=8 train_gpt.py

No extra env vars — QK_GAIN_INIT=5.25 and LOOP_END=6 are now script defaults.

Results pending on 8xA100 hardware.

Test plan

Run 3 seeds (42, 314, 999) on 8xA100s
Verify training completes under 600s
Verify artifact under 16,000,000 bytes
Verify sliding-window + TTT eval under 600s
Report val_bpb for each seed

Extends the SOTA 3-layer depth recurrence to 4 layers by setting LOOP_END=6 (was 5), creating 19 virtual layers from 11 physical instead of 17. Total layer-step compute budget is identical (~77,350). Also absorbs QK_GAIN_INIT=5.25 as the script default (was passed as an env var in the prior SOTA submission). Results pending on 8xA100 hardware.

- test_architecture.py: CPU-only PyTorch test verifying forward pass, gradient flow, and virtual layer structure for both SOTA and 4-layer configs; all checks pass locally - README: add virtual layer diagrams, skip-connection table, compute budget equivalence analysis, and verified test output

… candidates User shared a deep timeline of all recurrence experiments in the PG competition (openai#8 through openai#1739). Several of my previously-proposed experiments have ALREADY BEEN TESTED ON THIS STACK and shown to fail: KILLED: - Timing sweep earlier: openai#1726 showed 0.15 is +0.050 worse; openai#1739 showed step-0 catastrophic (1.3936 bpb) - Progressive ramp: openai#1663 showed hard-onset = smooth, no difference - Position shift: openai#1726 showed layer 2-7 +0.163 worse, layer 5-6 shift +0.006 worse — layer 3-5 IS the empirical sweet spot Also corrected the baseline config: openai#1736 uses LOOP_START=3 LOOP_END=5 (three layers: 3, 4, 5 — "Loop345"), not Loop45 as directory name suggests. 3 layers × 3 passes = 17 virtual layers. VIABLE candidates: - Recur-Alpha (openai#1714, Anakintano): learnable scalar per looped block, init 0 → identity. 6 params. Author's grant ran out before TTT eval so composition with openai#1736's phased TTT is genuinely open. NEW TOP PICK. - Cross-pass XSA: still novel, untested in any PR - Loop3-6 variant (openai#1678): tashapais running it; might wait for result Recommendation updated: port Recur-Alpha onto openai#1736 as spec 015. ~$25, identity-at-init (safe), 30 LOC, direct recurrence question. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

tashapais added 2 commits April 16, 2026 11:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SP8192 + 4-Layer Depth Recurrence (loop_end=6)#1678

SP8192 + 4-Layer Depth Recurrence (loop_end=6)#1678
tashapais wants to merge 2 commits intoopenai:mainfrom
tashapais:submission/sp8192-4layer-recurrence

tashapais commented Apr 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tashapais commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture

Virtual layer sequences

Compute Budget Equivalence

Local Verification

Reproduction

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tashapais commented Apr 16, 2026 •

edited

Loading