Non-record: Depth Recurrence + XSA + LeakyReLU² (val_bpb 1.2065) by iverbovoy · Pull Request #784 · openai/parameter-golf

iverbovoy · 2026-03-25T23:42:26Z

Summary

Depth recurrence with Cross-Repeat Skip — turns stateless weight sharing into stateful depth recurrence. Each block retains and mixes its output from the previous repeat via learned per-repeat scales. This is the core novel contribution from our previous submission (#148), now improved by -0.013 bpb with three zero-parameter additions.

3 blocks × 4 repeats (12 effective layers), dim=832, Cross-Repeat Skip, Value Embeddings
XSA (Exclusive Self-Attention) on last 4 effective layers: -0.010 bpb
LeakyReLU(0.5)² instead of relu²: -0.004 bpb
GPTQ-lite (best-of-5 clip percentiles) for post-training quantization
zstd-22 compression, SWA, Muon WD=0.04
17.14M params, 15.87MB artifact, 4300 steps @ 140ms/step on 8xH100

Results

Metric	Previous submission (#148)	This submission
val_bpb (sliding)	1.2196	1.2065
roundtrip_val_bpb	1.2533	1.2398

Ablations (8xH100, 80 shards, sliding window bpb)

Change	Sliding bpb	Delta
Baseline (previous submission repro)	1.2213	—
+ XSA last 4 layers	1.2110	-0.0103
+ LeakyReLU(0.5)²	1.2065	-0.0045

GPTQ-lite and zstd-22 are post-training optimizations that reduce roundtrip quant degradation (+0.003 bpb) but do not affect pre-quant sliding window eval.

Command

XSA_LAST_N=4 QUANT_LEVELS=127 EVAL_SEQ_LEN=1024 EVAL_STRIDE=256 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

- Replace 9 unique blocks with 3 blocks x 4 repeats (12 effective layers) - Increase dim from 512 to 832, remove U-Net skips - Add loop_embed for timestep encoding per effective layer - Add cross-repeat skip: each block mixes in its output from previous repeat with per-repeat learned scales (stateful recurrence) - Add 2 value embedding tables mixed into each layer with learned scales - 17.14M params, best result: 1.6780 bpb (int8+zlib) on 2000 steps batch 8K

- Add eval_val_ttt: adapts model on each val batch before evaluating - For each batch: save weights → K gradient steps → evaluate → restore - Controlled by TTT_STEPS (default 0 = disabled) and TTT_LR (default 1e-4) - Result: -0.010 bpb improvement on 200-step test (2.4124 → 2.4027) - TTT eval runs after normal roundtrip eval, reports both scores

- Sliding window eval: window=1024, stride=256, ~-0.034 bpb - forward_logits() method for sliding window support - LR x0.3: matrix=0.012, embed=0.015, scalar=0.012 (sweep winner) - GRAD_CLIP_NORM=0.3 for recurrence stability - WARMDOWN_ITERS=3000 - train@1024 (not 2048) — better for recurrence (160ms vs 253ms/step) - Fix grad_accum for non-power-of-2 GPU counts - Best result: 1.2308 bpb sliding window on 6xH100 (3726 steps)

- Fix quantization clamp_min(1/ql) -> clamp_min(1e-12) preventing broken roundtrip on undertrained models - Add Muon weight decay (0.04) for training stability - Add SWA with float32 accumulation and final snapshot inclusion - Remove sweep.sh

Improvements over previous submission (1.2196 → 1.2070, -0.014 bpb): - XSA (Exclusive Self-Attention) on last 4 effective layers: -0.010 bpb - LeakyReLU(0.5)² instead of relu²: -0.004 bpb - GPTQ-lite: per-row best-of-5 clip percentiles for quantization - zstd-22 compression instead of zlib (saves ~1.85MB artifact) - SWA tuned to frac=0.4, every=50 Tested on 8xH100, 80 train shards, PyTorch 2.5, 4290 steps.

Improvements over previous submission (1.2196 → 1.2065, -0.013 bpb): - XSA (Exclusive Self-Attention) on last 4 effective layers: -0.010 bpb - LeakyReLU(0.5)² instead of relu²: -0.004 bpb - GPTQ-lite: per-row best-of-5 clip percentiles - zstd-22 compression instead of zlib - SWA tuned to frac=0.4, every=50 8xH100, 80 train shards, 4300 steps, 140ms/step, 15.87MB artifact.

iverbovoy · 2026-04-05T14:49:46Z

Superseded by #1384 — clean submission with 3-seed validation.

iverbovoy added 7 commits March 20, 2026 03:37

Add submission: Depth Recurrence + Cross-Repeat Skip + Sliding Window

fa29306

Add SWA, Muon WD, fix quantization clamp

0f019a1

- Fix quantization clamp_min(1/ql) -> clamp_min(1e-12) preventing broken roundtrip on undertrained models - Add Muon weight decay (0.04) for training stability - Add SWA with float32 accumulation and final snapshot inclusion - Remove sweep.sh

This was referenced Mar 26, 2026

Progressive Depth + Hedge Mixer — val_bpb 1.1454 #856

Closed

Non-record: 4-Hour Progressive Depth — val_bpb 1.0889 #895

Open

iverbovoy mentioned this pull request Apr 5, 2026

Progressive Depth + Hedge Mixer — val_bpb 1.1441 (3-seed mean) #1384

Closed

iverbovoy closed this Apr 5, 2026

iverbovoy mentioned this pull request Apr 7, 2026

Non-record: Depth Recurrence + Int7 Mixed Quant — val_bpb 1.1324 (3-seed mean) #1453

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Depth Recurrence + XSA + LeakyReLU² (val_bpb 1.2065)#784

Non-record: Depth Recurrence + XSA + LeakyReLU² (val_bpb 1.2065)#784
iverbovoy wants to merge 7 commits intoopenai:mainfrom
iverbovoy:depth-recurrence-v3

iverbovoy commented Mar 25, 2026 •

edited

Loading

Uh oh!

iverbovoy commented Apr 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

iverbovoy commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Results

Ablations (8xH100, 80 shards, sliding window bpb)

Command

Uh oh!

iverbovoy commented Apr 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

iverbovoy commented Mar 25, 2026 •

edited

Loading