SP8192 + 4-Layer Depth Recurrence (loop_end=6) by tashapais · Pull Request #1 · tashapais/parameter-golf

tashapais · 2026-04-16T19:12:14Z

Summary

Extends the current SOTA (val_bpb=1.0810, PR openai#1509) by expanding depth recurrence from 3 looped layers to 4 (LOOP_END=6). All other techniques are unchanged.

The two-line code diff from SOTA:

# Before
loop_end=int(os.environ.get('LOOP_END',5))
qk_gain_init=float(os.environ.get('QK_GAIN_INIT',5.))

# After
loop_end=int(os.environ.get('LOOP_END',6))
qk_gain_init=float(os.environ.get('QK_GAIN_INIT',5.25))

Architecture Change

	SOTA	This PR
Looped layers	[3,4,5]	[3,4,5,6]
Virtual layers	17	19
U-Net skips	8	9
Est. training steps	4550	~4071

Virtual layer sequences:

SOTA enc: [0,1,2,3,4,5,3,4] dec: [5,3,4,5,6,7,8,9,10]
This PR enc: [0,1,2,3,4,5,6,3,4] dec: [5,6,3,4,5,6,7,8,9,10]

Motivation

The total layer-step compute budget is identical:

SOTA: 4550 x 17 = 77,350 layer-steps
This PR: ~4071 x 19 = 77,349 layer-steps

Prior submissions show each incremental depth expansion improved BPB. This PR tests whether organizing the same compute as a deeper (19-layer) rather than shallower (17-layer) virtual network continues that trend.

The additional looped layer also adds one more U-Net skip connection, enriching encoder-to-decoder cross-layer information flow.

Reproduction

pip install brotli sentencepiece
pip install flash_attn_3 --no-deps --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/
MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp8192

SEED=42 TTT_ENABLED=1 TTT_LR=0.005 TTT_EPOCHS=3 \
  torchrun --standalone --nproc_per_node=8 train_gpt.py

No extra env vars needed -- QK_GAIN_INIT=5.25 is now the default.

Results pending on 8xA100 hardware.

Test plan

Run 3 seeds (42, 314, 999) on 8xA100s
Verify training completes under 600s
Verify artifact under 16,000,000 bytes
Verify sliding-window + TTT eval under 600s
Report val_bpb for each seed

Extends the SOTA 3-layer depth recurrence to 4 layers by setting LOOP_END=6 (was 5), creating 19 virtual layers from 11 physical instead of 17. Total layer-step compute budget is identical (~77,350). Also absorbs QK_GAIN_INIT=5.25 as the script default (was passed as an env var in the prior SOTA submission). Results pending on 8xA100 hardware.

- test_architecture.py: CPU-only PyTorch test verifying forward pass, gradient flow, and virtual layer structure for both SOTA and 4-layer configs; all checks pass locally - README: add virtual layer diagrams, skip-connection table, compute budget equivalence analysis, and verified test output

tashapais added 2 commits April 16, 2026 11:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SP8192 + 4-Layer Depth Recurrence (loop_end=6)#1

SP8192 + 4-Layer Depth Recurrence (loop_end=6)#1
tashapais wants to merge 2 commits intomainfrom
submission/sp8192-4layer-recurrence

tashapais commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tashapais commented Apr 16, 2026

Summary

Architecture Change

Motivation

Reproduction

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant