Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
170 changes: 170 additions & 0 deletions SUMMARY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
# Depth Recurrence in Parameter Golf — Research Summary

Ivan Verbovoy (@iverbovoy) · 20.03.2026 → 20.04.2026

## TL;DR

Single-person submission exploring **depth recurrence** (3 shared transformer blocks × 4 repeats = 12 effective layers) as an alternative to the flat 10-11 layer architectures used by the leaderboard. Best result: **val_bpb 1.1324 (3-seed mean)** on the 10-min track (PR [#1453](https://github.com/openai/parameter-golf/pull/1453)). Additional **4-hour non-record 1.0889** (PR [#895](https://github.com/openai/parameter-golf/pull/895)). OpenAI-acknowledged the approach as novel and published a dedicated non-record PR [#363](https://github.com/openai/parameter-golf/pull/363) inspired by similar exploration.

## Architecture

```
tok_emb (+ optional BigramHash) + value_embeds × 2
for repeat in {0..3}:
for block in {A, B, C}: # 3 shared blocks
x += loop_embed[layer_idx] # per effective layer
x += Σ value_scales[l,e] * ve_e # per effective layer
x += cross_repeat_scale * block_out_prev_repeat # stateful recurrence
x = block(x, x0, use_xsa=(layer_idx ≥ xsa_start))
final_norm + tied LM head + softcap
```

Key weight-sharing components:
- **loop_embed** `(effective_depth, model_dim)` — positional signal per effective layer
- **cross_repeat_scales** `(num_blocks, num_repeats-1, dim)` — stateful residual from prev repeat
- **resid_mix** — learned per-dim mix between current and block-0 residual
- **XSA** — last 4 effective layers subtract self-value projection
- **Hedge Mixer** — eval-time online mixture of Neural + Unigram + Bigram + Trigram(hash 65K) + Entropy experts

## Progression

| Date | PR | Track | Key idea | val_bpb |
|:----:|:--:|:-----:|:---------|--------:|
| 20.03 | [#148](https://github.com/openai/parameter-golf/pull/148) | 10min | Depth Recurrence + Cross-Repeat Skip | 1.2196 |
| 25.03 | [#784](https://github.com/openai/parameter-golf/pull/784) | 10min | + XSA(4) + LeakyReLU²(0.5) | 1.2065 |
| 26.03 | [#835](https://github.com/openai/parameter-golf/pull/835) | 10min | + Progressive Depth (2→3→4 repeats) | 1.1980 |
| 26.03 | [#856](https://github.com/openai/parameter-golf/pull/856) | 10min | + Hedge Mixer | 1.1454 |
| 26.03 | **[#895](https://github.com/openai/parameter-golf/pull/895)** | 4h | 4-hour Progressive Depth | **1.0889** |
| 05.04 | [#1384](https://github.com/openai/parameter-golf/pull/1384) | 10min | + tuned schedule + WD + SWA (3-seed) | 1.1441 |
| 07.04 | **[#1453](https://github.com/openai/parameter-golf/pull/1453)** | 10min | + **Int7 attn + Int5 MLP mixed quant** (3-seed) | **1.1324** |

## Experiments catalog

### What worked (baseline 1.1324)

| Technique | Effect | Notes |
|-----------|-------:|:------|
| Depth Recurrence 3×4 | — | Core architecture, enables 23.7M params in 16MB |
| Cross-Repeat Skip | −0.03 | Prev-repeat residual makes recurrence stateful |
| Value embeds (2 tables) | −0.07 | Critical. Adds per-layer token lookup |
| XSA last 4 | −0.01 | Self-value bias removal at top layers |
| Progressive Depth (0.30:2, 0.50:3, 1.0:4) | −0.005 | Ramp repeats during training |
| SWA (start 0.6, every 30) | −0.01 | ~44 checkpoints averaged |
| Hedge Mixer (5 experts) | −0.05 | Eval-time mixture, but stochastic (std 0.013) |
| **Int7 attn + Int5 MLP mixed quant** | −0.012 | Frees 2MB for d=880 mlp×3 vs d=832 mlp×2 |
| Muon optimizer + WD=0.04 | — | Standard for challenge |

### What did NOT improve mean 1.1324

Tested on 1–3 seeds and verified neither sliding nor hedge-mean improves:

| Technique | Result | Why |
|-----------|:------:|:----|
| BigramHash 2048×112 | −0.005 ❌ | Too few buckets, hash collisions dominate |
| BigramHash 3072×112 | +0.005 ❌ | Single-seed −0.003 but 3-seed mean worse: stabilizes hedge but cuts peaks (seed 7 went 1.1193→1.1444) |
| BigramHash 4096×112 | +0.004 ❌ | Past sweet spot, sparse buckets degrade |
| Noisy QAT (default) | +0.011 ❌ | Noise on int5 MLP too large (~amax/15), SWA collects pre-QAT checkpoints |
| LoRA rank-2 per-repeat (attn.proj, mlp.proj) | +0.013 ❌ | Per-repeat signal already saturated by loop_embed + cross_repeat_scales |
| XSA-all (12 layers) | worse | Optimum is last 4, early XSA hurts |
| Inter-repeat RMSNorm | worse | Breaks scaling balance |
| EMA (τ=0.997) | +22ms/step | CPU overhead > benefit at our scale |
| Partial RoPE + VRL + LN Scale (combined) | worse | Too many interacting changes |
| MuonEq-R optimizer | diverged | Incompatible with our Muon setup |
| Auxiliary losses (edge-of-chaos regularization) | neutral | χ stabilized but bpb unchanged at 5 repeats |
| 3×6 d=960 | worse | Fewer steps dominates |
| 6×2 d=640/736 | worse | Too narrow |
| 4L × 3rep | worse | Fewer unique blocks in limited compute |
| TTT (LoRA-based) | −0.002 | Positive but 410s eval; dropped for budget |
| SD-clip k=3.5, k=10 | worse | Percentile-search already near optimum for int8 |

### GPTQ with Hessian error compensation (3-seed validated)

Implemented column-wise GPTQ with training-data calibration (no access to val). Collects `X^T X` per `nn.Linear` over 5 training batches, then column-by-column quantization with Cholesky(H_inv) error compensation. ~100 lines added to 1496-line submission.

| Seed | roundtrip Δ | sliding Δ | hedge Δ |
|------|------------:|----------:|--------:|
| 1337 | −0.0034 | −0.0033 | +0.008 |
| 42 | −0.0007 | −0.0008 | −0.0006 |
| 7 | −0.0013 | −0.0013 | +0.023 |
| **3-seed mean** | **−0.0018** | **−0.0018** | **+0.010** |

**Deterministic improvement** on sliding/roundtrip (both −0.002). **Hedge mean worse by +0.010** — submission #1453's seed 7 hedge was unusually low (1.1193) and we couldn't reproduce that luck in our session.

Implication: GPTQ makes the model genuinely better (sliding/roundtrip = deterministic metric of model quality), but `val_bpb` is scored on hedge which has ±0.013 seed variance + ±0.008 session variance. The model-level gain gets dominated by hedge stochasticity.

Not submitting GPTQ as replacement — #1453 remains the best hedge-mean result. GPTQ-enhanced code kept as reference.

## Key insights

### 1. Depth recurrence is viable but not SOTA for this challenge

Our 1.1324 (3-seed) vs SOTA 1.1147 (abaybektursun's flat 11×512 + AR Self-Gen GPTQ + BigramHash 3072×112). Gap ~0.018. Evangelinehelsinki's separate exploration found flat 11L beats 3×3 recurrence by ~0.025 at same trick stack. **Recurrence trades unique parameters for effective depth**, which helps fit 23.7M params in 16MB but underperforms flat architecture per-layer.

### 2. Hedge Mixer dominates and destabilizes

Hedge gives ~−0.05 bpb lift over sliding but has huge variance:
- **±0.013 bpb between seeds** (same config)
- **±0.008 bpb between sessions** at identical model weights (sanity-run confirmed roundtrip/sliding match to 0.0002, hedge diverged 0.008)

Most architectural gains get absorbed by hedge noise. Deterministic metrics (sliding, roundtrip) are the reliable signal.

### 3. Weight-sharing saturates quickly

On 3×4 recurrence:
- loop_embed + cross_repeat_scales + value_scales already provide per-repeat variance
- LoRA per-repeat on top **hurt** (+0.006 sliding) — the model was already using available capacity
- Inter-repeat RMSNorm also hurt

Additional per-repeat degrees of freedom have diminishing/negative returns.

### 4. Progressive Depth schedule matters

Shifting schedule from (0.40:2, 0.65:3, 1.0:4) to **(0.30:2, 0.50:3, 1.0:4)** gave −0.004 bpb — 55% more full-depth training steps. Combined with longer warmdown (3000 vs 2000) and denser SWA (every 30 vs 50) at higher start frac (0.6 vs 0.4) for ~44 averaged checkpoints.

### 5. Mixed quantization > uniform

Separating attn (int7, 63 levels) from MLP (int5, 16 levels):
- Attention quality drop dominates total loss at low precision → keep attn higher
- MLP tolerates aggressive quantization → allows 2MB saving
- 2MB saved → model width up from d=832 mlp×2 → d=880 mlp×3

Gain: −0.012 bpb.

### 6. Calibration data makes GPTQ work

Original percentile-search GPTQ ("GPTQ-lite" in our code) only optimizes per-row clip point via MSE. Full GPTQ with column-wise Hessian error compensation gave deterministic −0.002..−0.003 on sliding. Training-data calibration worked; AR self-gen calibration would likely stabilize further.

## Files

- Main submission: `records/track_non_record_16mb/2026-04-08_DepthRecurrence_Int7MixedQuant_HedgeMixer/` (PR #1453 backing)
- 4-hour submission: PR #895
- Experimental code variants in repo root: `train_gpt_refactored.py`, `train_gpt_exp1.py`, etc.

## Reproduction

Config used for PR #1453 (submitted):
```
MODEL_DIM=880 NUM_HEADS=8 NUM_KV_HEADS=4 MLP_MULT=3
NUM_LAYERS=3 NUM_REPEATS=4
QUANT_LEVELS=63 MLP_QUANT_LEVELS=16
PROG_DEPTH="0.30:2,0.50:3,1.0:4"
WARMDOWN_ITERS=3000
SWA_START_FRAC=0.6 SWA_EVERY=30
MATRIX_LR=0.018 MUON_WD=0.04
XSA_LAST_N=4 QK_GAIN_INIT=1.5
USE_HEDGE=1 HEDGE_ETA=0.1
MAX_WALLCLOCK_SECONDS=600
```

3 seeds tested (1337, 42, 7) on 8× H100 SXM 80GB, PyTorch 2.5.1.

## Resource footprint

- RunPod compute grant: ~$950 of $1000 used
- ~25 full training runs + calibration experiments
- 1 person, 32 days

## Acknowledgments

Thanks to OpenAI for running this challenge and sponsoring the compute grant. Thanks to **abaybektursun**, **thwu1**, **Raahil Shah**, **Evangelinehelsinki** for publishing detailed submissions that informed several of my experiments (particularly GPTQ calibration, BigramHash sizing, and the noisy-QAT analysis for recurrent architectures).
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
# Non-record: Depth Recurrence + Int7 Mixed Quant + Parallel Hedge Mixer

**val_bpb: 1.1324** (3-seed mean, std 0.0131) | **~15.40 MB** | 8×H100 SXM, 600s

Improves on [PR #1384](https://github.com/openai/parameter-golf/pull/1384) (1.1441 bpb) by **−0.012 bpb** through mixed int7/int5 quantization enabling a wider MLP 3× model, and parallelized hedge mixer eval.

## Results (8×H100 80GB SXM, PyTorch 2.5.1)

| Seed | Steps | ms/step | Roundtrip | Sliding | **Hedge** | Artifact | Eval time |
|------|-------|---------|-----------|---------|-----------|----------|-----------|
| 1337 | 4,247 | 141.3ms | 1.2168 | 1.1832 | **1.1324** | 15.40 MB | 167s |
| 42 | 4,389 | 136.7ms | 1.2172 | 1.1840 | **1.1454** | 15.28 MB | 164s |
| 7 | 4,391 | 136.7ms | 1.2163 | 1.1828 | **1.1193** | 15.29 MB | 163s |
| **Mean** | **4,342** | **138.2ms** | **1.2168** | **1.1834** | **1.1324** | | **~164s** |

Additional seeds for variance analysis: seed 2024 → 1.1431, seed 99 → 1.1405. 5-seed mean: **1.1361** (std 0.0095).

## Changes vs PR #1384 (1.1441 bpb)

| Change | Effect | Impact |
|--------|--------|--------|
| MLP 2× → 3× (d=832→880) | +38% parameters, wider model | −0.013 sliding bpb |
| Int8 → **Int7 attn** + Int5 MLP | Fits larger model in 16MB budget | enables above |
| Earlier progressive depth (30/50 vs 40/65) | +55% full-depth training steps | −0.004 bpb |
| More SWA (every 30, start 0.6) | 43 checkpoints vs 13 | smoother average |
| Parallel hedge eval (8 GPU) | 580s → 164s eval time | fits 10 min budget |

## Key Finding: Int7 Attention is the Sweet Spot

Standard approaches use uniform quantization (all int8 or all int6). Experiments show that **attention and MLP weights have very different sensitivity to quantization**:

- **Attention weights** directly affect the neural expert in hedge mixer. Int6 (31 levels) causes hedge boost to drop from −0.052 to −0.039 — a significant quality loss.
- **MLP weights** tolerate aggressive quantization. Int5 (16 levels) compresses well with minimal quality impact.
- **Int7 (63 levels)** for attention recovers hedge boost to −0.051, nearly matching int8's −0.052.

The 2MB saved by using int5 MLP instead of int8 is reinvested into a wider model (d=880 with MLP 3× vs d=832 with MLP 2×).

| Quant config | Model | Sliding | Hedge | Hedge boost | Size | Fits? |
|-------------|-------|---------|-------|-------------|------|-------|
| Int8 attn + Int5 MLP | d=896 | 1.1760 | 1.1349 | −0.041 | 17.4 MB | ✗ |
| **Int7 attn + Int5 MLP** | **d=880** | **1.1832** | **1.1324** | **−0.051** | **15.4 MB** | **✓** |
| Int6 attn + Int5 MLP | d=896 | 1.1870 | 1.1480 | −0.039 | 15.4 MB | ✓ |

## Architecture: Depth Recurrence

Instead of 9–11 unique transformer blocks, **3 shared blocks are repeated 4 times** (12 effective layers). This trades unique parameters for effective depth, fitting 23.7M parameters into ~15.4 MB.

| Parameter | Value |
|-----------|-------|
| Layers × Repeats | 3 × 4 (12 effective) |
| Model dim | 880 |
| Heads / KV heads | 8 / 4 (head_dim=110) |
| MLP multiplier | 3× (hidden=2640) |
| Vocab size | 1024 (SP BPE) |
| Parameters | 23.7M |
| Logit softcap | 30.0 |

### Recurrence components

- **Cross-Repeat Skip**: Each block receives a weighted residual from its own output in the previous repeat — turns stateless recurrence into stateful
- **Loop Embedding**: Learned per-layer vector added before each block — depth-wise positional encoding for shared weights
- **Value Embeddings**: 2 extra embedding tables mixed into the residual stream at each effective layer with learned scales
- **XSA (Exclusive Self-Attention)**: On last 4 effective layers — prevents attention collapse in deep recurrent models
- **LeakyReLU(0.5)²**: Better gradient flow than ReLU² for deep/recurrent models

## Progressive Depth Training

Training uses increasing recurrence depth, recompiling at phase boundaries:

| Phase | Wallclock | Repeats | Effective layers | Step speed |
|-------|-----------|---------|-----------------|------------|
| 0–30% | 0–180s | 2 | 6 | ~90ms |
| 30–50% | 180–300s | 3 | 9 | ~105ms |
| 50–100% | 300–600s | 4 | 12 | ~130ms |

Schedule tuned for the MLP 3× config: earlier transitions (30/50% vs 40/65% in PR #1384) give +55% more steps at full depth. Warmdown 3000 iterations, SWA every 30 steps from LR scale < 0.6 (~43 checkpoints).

## Eval: Parallel Hedge Mixer

5-expert online ensemble with **8-GPU parallelized forward pass**:

| Expert | Description |
|--------|-------------|
| Neural | Model's own logits (log-softmax) |
| Unigram | Global token frequency with Laplace smoothing |
| Bigram | Conditional P(token \| prev_token) |
| Trigram | Hashed trigram context (65K buckets) |
| Entropy | Model's entropy as calibration signal |

**Parallelization**: Each batch of windows is split across 8 GPUs for the forward pass, logits gathered via `all_gather` to rank 0 for sequential mixer scoring. This reduces hedge eval from 580s (single GPU) to **164s**, fitting within the 10-minute eval budget.

Hedge provides **−0.051 bpb improvement** over sliding window (1.1834 → 1.1324 mean).

## Training Details

| Parameter | Value |
|-----------|-------|
| Optimizer | Muon (matrices) + Adam (scalars, embeddings) |
| Matrix / Scalar LR | 0.018 / 0.018 |
| Tied embed LR | 0.021 |
| Muon WD | 0.04 |
| Muon momentum | 0.95 (warmup 0.85→0.95 over 500 steps) |
| Grad clip | 0.3 |
| Batch tokens | 524,288 |
| Quantization | Int7 attn (63 levels) + Int5 MLP (16 levels) + zstd-22 |

## Evolution

| PR | Date | Score | What changed |
|----|------|-------|-------------|
| [#148](https://github.com/openai/parameter-golf/pull/148) | Mar 20 | 1.2196 (sliding) | Depth recurrence (3×4), cross-repeat skip, value embeddings |
| [#784](https://github.com/openai/parameter-golf/pull/784) | Mar 25 | 1.2065 (sliding) | + XSA(4), LeakyReLU², GPTQ-lite, zstd-22 |
| [#835](https://github.com/openai/parameter-golf/pull/835) | Mar 26 | 1.1980 (sliding) | + Progressive depth training (+30% steps) |
| [#1384](https://github.com/openai/parameter-golf/pull/1384) | Apr 5 | 1.1441 (hedge) | + Hedge Mixer (5-expert eval-time ensemble) |
| **This PR** | Apr 8 | **1.1324** (hedge) | + Int7 mixed quant, MLP 3×, d=880, parallel hedge |

## Lineage

- Depth recurrence architecture — original to this submission line
- XSA from [PR #198](https://github.com/openai/parameter-golf/pull/198) (unnir)
- LeakyReLU² from [PR #493](https://github.com/openai/parameter-golf/pull/493) (parinzee)
- Mixed int5/int6 quantization concept from [PR #549](https://github.com/openai/parameter-golf/pull/549) (thwu1), extended here to int7
- SWA, Muon WD from modded-nanogpt community

## Reproducing

```bash
SEED=1337 QUANT_LEVELS=63 MLP_QUANT_LEVELS=16 \
MODEL_DIM=880 NUM_HEADS=8 NUM_KV_HEADS=4 MLP_MULT=3 \
NUM_LAYERS=3 NUM_REPEATS=4 XSA_LAST_N=4 NUM_VALUE_EMBEDS=2 \
PROG_DEPTH="0.30:2,0.50:3,1.0:4" \
WARMDOWN_ITERS=3000 SWA_START_FRAC=0.6 SWA_EVERY=30 \
VOCAB_SIZE=1024 TRAIN_SEQ_LEN=1024 TRAIN_BATCH_TOKENS=524288 \
torchrun --nproc_per_node=8 train_gpt.py
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
{
"author": "Ivan Verbovoy",
"github_id": "iverbovoy",
"name": "Depth Recurrence + Int7 Mixed Quantization + Parallel Hedge Mixer",
"blurb": "3 shared blocks x 4 repeats (12 effective layers) with MLP 3x (d=880), progressive depth (2->3->4 repeats), int7 attention (63 levels) + int5 MLP (16 levels) mixed quantization, 8-GPU parallel Hedge Mixer eval. Key finding: int7 is the sweet spot for attention quantization — recovers 98% of int8 hedge quality while saving 2MB for a wider model. 5 seeds tested, 3-seed mean reported.",
"date": "2026-04-08T00:00:00Z",
"val_loss": 1.91197327,
"val_bpb": 1.13237601,
"roundtrip_val_bpb": 1.21676461,
"sliding_val_bpb": 1.18335612,
"seeds": [1337, 42, 7],
"mean_steps": 4342,
"wallclock_seconds": 600,
"eval_seconds": 164,
"bytes_model_int8_zstd22": 15403955
}
Loading