Skip to content

SP8192 + 4-Layer Depth Recurrence (loop_end=6)#1

Open
tashapais wants to merge 2 commits intomainfrom
submission/sp8192-4layer-recurrence
Open

SP8192 + 4-Layer Depth Recurrence (loop_end=6)#1
tashapais wants to merge 2 commits intomainfrom
submission/sp8192-4layer-recurrence

Conversation

@tashapais
Copy link
Copy Markdown
Owner

Summary

Extends the current SOTA (val_bpb=1.0810, PR openai#1509) by expanding depth recurrence from 3 looped layers to 4 (LOOP_END=6). All other techniques are unchanged.

The two-line code diff from SOTA:

# Before
loop_end=int(os.environ.get('LOOP_END',5))
qk_gain_init=float(os.environ.get('QK_GAIN_INIT',5.))

# After
loop_end=int(os.environ.get('LOOP_END',6))
qk_gain_init=float(os.environ.get('QK_GAIN_INIT',5.25))

Architecture Change

SOTA This PR
Looped layers [3,4,5] [3,4,5,6]
Virtual layers 17 19
U-Net skips 8 9
Est. training steps 4550 ~4071

Virtual layer sequences:

  • SOTA enc: [0,1,2,3,4,5,3,4] dec: [5,3,4,5,6,7,8,9,10]
  • This PR enc: [0,1,2,3,4,5,6,3,4] dec: [5,6,3,4,5,6,7,8,9,10]

Motivation

The total layer-step compute budget is identical:

  • SOTA: 4550 x 17 = 77,350 layer-steps
  • This PR: ~4071 x 19 = 77,349 layer-steps

Prior submissions show each incremental depth expansion improved BPB. This PR tests whether organizing the same compute as a deeper (19-layer) rather than shallower (17-layer) virtual network continues that trend.

The additional looped layer also adds one more U-Net skip connection, enriching encoder-to-decoder cross-layer information flow.

Reproduction

pip install brotli sentencepiece
pip install flash_attn_3 --no-deps --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/
MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp8192

SEED=42 TTT_ENABLED=1 TTT_LR=0.005 TTT_EPOCHS=3 \
  torchrun --standalone --nproc_per_node=8 train_gpt.py

No extra env vars needed -- QK_GAIN_INIT=5.25 is now the default.

Results pending on 8xA100 hardware.

Test plan

  • Run 3 seeds (42, 314, 999) on 8xA100s
  • Verify training completes under 600s
  • Verify artifact under 16,000,000 bytes
  • Verify sliding-window + TTT eval under 600s
  • Report val_bpb for each seed

Extends the SOTA 3-layer depth recurrence to 4 layers by setting
LOOP_END=6 (was 5), creating 19 virtual layers from 11 physical
instead of 17. Total layer-step compute budget is identical (~77,350).

Also absorbs QK_GAIN_INIT=5.25 as the script default (was passed
as an env var in the prior SOTA submission).

Results pending on 8xA100 hardware.
- test_architecture.py: CPU-only PyTorch test verifying forward pass,
  gradient flow, and virtual layer structure for both SOTA and 4-layer
  configs; all checks pass locally
- README: add virtual layer diagrams, skip-connection table, compute
  budget equivalence analysis, and verified test output
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant