Skip to content

Non-record: Depth Recurrence + XSA + LeakyReLU² (val_bpb 1.2065)#784

Closed
iverbovoy wants to merge 7 commits intoopenai:mainfrom
iverbovoy:depth-recurrence-v3
Closed

Non-record: Depth Recurrence + XSA + LeakyReLU² (val_bpb 1.2065)#784
iverbovoy wants to merge 7 commits intoopenai:mainfrom
iverbovoy:depth-recurrence-v3

Conversation

@iverbovoy
Copy link
Copy Markdown

@iverbovoy iverbovoy commented Mar 25, 2026

Summary

Depth recurrence with Cross-Repeat Skip — turns stateless weight sharing into stateful depth recurrence. Each block retains and mixes its output from the previous repeat via learned per-repeat scales. This is the core novel contribution from our previous submission (#148), now improved by -0.013 bpb with three zero-parameter additions.

  • 3 blocks × 4 repeats (12 effective layers), dim=832, Cross-Repeat Skip, Value Embeddings
  • XSA (Exclusive Self-Attention) on last 4 effective layers: -0.010 bpb
  • LeakyReLU(0.5)² instead of relu²: -0.004 bpb
  • GPTQ-lite (best-of-5 clip percentiles) for post-training quantization
  • zstd-22 compression, SWA, Muon WD=0.04
  • 17.14M params, 15.87MB artifact, 4300 steps @ 140ms/step on 8xH100

Results

Metric Previous submission (#148) This submission
val_bpb (sliding) 1.2196 1.2065
roundtrip_val_bpb 1.2533 1.2398

Ablations (8xH100, 80 shards, sliding window bpb)

Change Sliding bpb Delta
Baseline (previous submission repro) 1.2213
+ XSA last 4 layers 1.2110 -0.0103
+ LeakyReLU(0.5)² 1.2065 -0.0045

GPTQ-lite and zstd-22 are post-training optimizations that reduce roundtrip quant degradation (+0.003 bpb) but do not affect pre-quant sliding window eval.

Command

XSA_LAST_N=4 QUANT_LEVELS=127 EVAL_SEQ_LEN=1024 EVAL_STRIDE=256 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

- Replace 9 unique blocks with 3 blocks x 4 repeats (12 effective layers)
- Increase dim from 512 to 832, remove U-Net skips
- Add loop_embed for timestep encoding per effective layer
- Add cross-repeat skip: each block mixes in its output from previous repeat
  with per-repeat learned scales (stateful recurrence)
- Add 2 value embedding tables mixed into each layer with learned scales
- 17.14M params, best result: 1.6780 bpb (int8+zlib) on 2000 steps batch 8K
- Add eval_val_ttt: adapts model on each val batch before evaluating
- For each batch: save weights → K gradient steps → evaluate → restore
- Controlled by TTT_STEPS (default 0 = disabled) and TTT_LR (default 1e-4)
- Result: -0.010 bpb improvement on 200-step test (2.4124 → 2.4027)
- TTT eval runs after normal roundtrip eval, reports both scores
- Sliding window eval: window=1024, stride=256, ~-0.034 bpb
- forward_logits() method for sliding window support
- LR x0.3: matrix=0.012, embed=0.015, scalar=0.012 (sweep winner)
- GRAD_CLIP_NORM=0.3 for recurrence stability
- WARMDOWN_ITERS=3000
- train@1024 (not 2048) — better for recurrence (160ms vs 253ms/step)
- Fix grad_accum for non-power-of-2 GPU counts
- Best result: 1.2308 bpb sliding window on 6xH100 (3726 steps)
- Fix quantization clamp_min(1/ql) -> clamp_min(1e-12) preventing
  broken roundtrip on undertrained models
- Add Muon weight decay (0.04) for training stability
- Add SWA with float32 accumulation and final snapshot inclusion
- Remove sweep.sh
Improvements over previous submission (1.2196 → 1.2070, -0.014 bpb):
- XSA (Exclusive Self-Attention) on last 4 effective layers: -0.010 bpb
- LeakyReLU(0.5)² instead of relu²: -0.004 bpb
- GPTQ-lite: per-row best-of-5 clip percentiles for quantization
- zstd-22 compression instead of zlib (saves ~1.85MB artifact)
- SWA tuned to frac=0.4, every=50

Tested on 8xH100, 80 train shards, PyTorch 2.5, 4290 steps.
Improvements over previous submission (1.2196 → 1.2065, -0.013 bpb):
- XSA (Exclusive Self-Attention) on last 4 effective layers: -0.010 bpb
- LeakyReLU(0.5)² instead of relu²: -0.004 bpb
- GPTQ-lite: per-row best-of-5 clip percentiles
- zstd-22 compression instead of zlib
- SWA tuned to frac=0.4, every=50

8xH100, 80 train shards, 4300 steps, 140ms/step, 15.87MB artifact.
@iverbovoy
Copy link
Copy Markdown
Author

Superseded by #1384 — clean submission with 3-seed validation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant