openai · iverbovoy · Apr 7, 2026 · Apr 7, 2026 · Apr 20, 2026
diff --git a/SUMMARY.md b/SUMMARY.md
@@ -0,0 +1,170 @@
+# Depth Recurrence in Parameter Golf — Research Summary
+
+Ivan Verbovoy (@iverbovoy) · 20.03.2026 → 20.04.2026
+
+## TL;DR
+
+Single-person submission exploring **depth recurrence** (3 shared transformer blocks × 4 repeats = 12 effective layers) as an alternative to the flat 10-11 layer architectures used by the leaderboard. Best result: **val_bpb 1.1324 (3-seed mean)** on the 10-min track (PR [#1453](https://github.com/openai/parameter-golf/pull/1453)). Additional **4-hour non-record 1.0889** (PR [#895](https://github.com/openai/parameter-golf/pull/895)). OpenAI-acknowledged the approach as novel and published a dedicated non-record PR [#363](https://github.com/openai/parameter-golf/pull/363) inspired by similar exploration.
+
+## Architecture
+
+```
+tok_emb (+ optional BigramHash) + value_embeds × 2
+  │
+  for repeat in {0..3}:
+    for block in {A, B, C}:            # 3 shared blocks
+      x += loop_embed[layer_idx]        # per effective layer
+      x += Σ value_scales[l,e] * ve_e   # per effective layer
+      x += cross_repeat_scale * block_out_prev_repeat  # stateful recurrence
+      x = block(x, x0, use_xsa=(layer_idx ≥ xsa_start))
+  final_norm + tied LM head + softcap
+```
+
+Key weight-sharing components:
+- **loop_embed** `(effective_depth, model_dim)` — positional signal per effective layer
+- **cross_repeat_scales** `(num_blocks, num_repeats-1, dim)` — stateful residual from prev repeat
+- **resid_mix** — learned per-dim mix between current and block-0 residual
+- **XSA** — last 4 effective layers subtract self-value projection
+- **Hedge Mixer** — eval-time online mixture of Neural + Unigram + Bigram + Trigram(hash 65K) + Entropy experts
+
+## Progression
+
+| Date | PR | Track | Key idea | val_bpb |
+|:----:|:--:|:-----:|:---------|--------:|
+| 20.03 | [#148](https://github.com/openai/parameter-golf/pull/148) | 10min | Depth Recurrence + Cross-Repeat Skip | 1.2196 |
+| 25.03 | [#784](https://github.com/openai/parameter-golf/pull/784) | 10min | + XSA(4) + LeakyReLU²(0.5) | 1.2065 |
+| 26.03 | [#835](https://github.com/openai/parameter-golf/pull/835) | 10min | + Progressive Depth (2→3→4 repeats) | 1.1980 |
+| 26.03 | [#856](https://github.com/openai/parameter-golf/pull/856) | 10min | + Hedge Mixer | 1.1454 |
+| 26.03 | **[#895](https://github.com/openai/parameter-golf/pull/895)** | 4h | 4-hour Progressive Depth | **1.0889** |
+| 05.04 | [#1384](https://github.com/openai/parameter-golf/pull/1384) | 10min | + tuned schedule + WD + SWA (3-seed) | 1.1441 |
+| 07.04 | **[#1453](https://github.com/openai/parameter-golf/pull/1453)** | 10min | + **Int7 attn + Int5 MLP mixed quant** (3-seed) | **1.1324** |
+
+## Experiments catalog
+
+### What worked (baseline 1.1324)
+
+| Technique | Effect | Notes |
+|-----------|-------:|:------|
+| Depth Recurrence 3×4 | — | Core architecture, enables 23.7M params in 16MB |
+| Cross-Repeat Skip | −0.03 | Prev-repeat residual makes recurrence stateful |
+| Value embeds (2 tables) | −0.07 | Critical. Adds per-layer token lookup |
+| XSA last 4 | −0.01 | Self-value bias removal at top layers |
+| Progressive Depth (0.30:2, 0.50:3, 1.0:4) | −0.005 | Ramp repeats during training |
+| SWA (start 0.6, every 30) | −0.01 | ~44 checkpoints averaged |
+| Hedge Mixer (5 experts) | −0.05 | Eval-time mixture, but stochastic (std 0.013) |
+| **Int7 attn + Int5 MLP mixed quant** | −0.012 | Frees 2MB for d=880 mlp×3 vs d=832 mlp×2 |
+| Muon optimizer + WD=0.04 | — | Standard for challenge |
+
+### What did NOT improve mean 1.1324
+
+Tested on 1–3 seeds and verified neither sliding nor hedge-mean improves:
+
+| Technique | Result | Why |
+|-----------|:------:|:----|
+| BigramHash 2048×112 | −0.005 ❌ | Too few buckets, hash collisions dominate |
+| BigramHash 3072×112 | +0.005 ❌ | Single-seed −0.003 but 3-seed mean worse: stabilizes hedge but cuts peaks (seed 7 went 1.1193→1.1444) |
+| BigramHash 4096×112 | +0.004 ❌ | Past sweet spot, sparse buckets degrade |
+| Noisy QAT (default) | +0.011 ❌ | Noise on int5 MLP too large (~amax/15), SWA collects pre-QAT checkpoints |
+| LoRA rank-2 per-repeat (attn.proj, mlp.proj) | +0.013 ❌ | Per-repeat signal already saturated by loop_embed + cross_repeat_scales |
+| XSA-all (12 layers) | worse | Optimum is last 4, early XSA hurts |
+| Inter-repeat RMSNorm | worse | Breaks scaling balance |
+| EMA (τ=0.997) | +22ms/step | CPU overhead > benefit at our scale |
+| Partial RoPE + VRL + LN Scale (combined) | worse | Too many interacting changes |
+| MuonEq-R optimizer | diverged | Incompatible with our Muon setup |
+| Auxiliary losses (edge-of-chaos regularization) | neutral | χ stabilized but bpb unchanged at 5 repeats |
+| 3×6 d=960 | worse | Fewer steps dominates |
+| 6×2 d=640/736 | worse | Too narrow |
+| 4L × 3rep | worse | Fewer unique blocks in limited compute |
+| TTT (LoRA-based) | −0.002 | Positive but 410s eval; dropped for budget |
+| SD-clip k=3.5, k=10 | worse | Percentile-search already near optimum for int8 |
+
+### GPTQ with Hessian error compensation (3-seed validated)
+
+Implemented column-wise GPTQ with training-data calibration (no access to val). Collects `X^T X` per `nn.Linear` over 5 training batches, then column-by-column quantization with Cholesky(H_inv) error compensation. ~100 lines added to 1496-line submission.
+
+| Seed | roundtrip Δ | sliding Δ | hedge Δ |
+|------|------------:|----------:|--------:|
+| 1337 | −0.0034 | −0.0033 | +0.008 |
+| 42 | −0.0007 | −0.0008 | −0.0006 |
+| 7 | −0.0013 | −0.0013 | +0.023 |
+| **3-seed mean** | **−0.0018** | **−0.0018** | **+0.010** |
+
+**Deterministic improvement** on sliding/roundtrip (both −0.002). **Hedge mean worse by +0.010** — submission #1453's seed 7 hedge was unusually low (1.1193) and we couldn't reproduce that luck in our session.
+
+Implication: GPTQ makes the model genuinely better (sliding/roundtrip = deterministic metric of model quality), but `val_bpb` is scored on hedge which has ±0.013 seed variance + ±0.008 session variance. The model-level gain gets dominated by hedge stochasticity.
+
+Not submitting GPTQ as replacement — #1453 remains the best hedge-mean result. GPTQ-enhanced code kept as reference.
+
+## Key insights
+
+### 1. Depth recurrence is viable but not SOTA for this challenge
+
+Our 1.1324 (3-seed) vs SOTA 1.1147 (abaybektursun's flat 11×512 + AR Self-Gen GPTQ + BigramHash 3072×112). Gap ~0.018. Evangelinehelsinki's separate exploration found flat 11L beats 3×3 recurrence by ~0.025 at same trick stack. **Recurrence trades unique parameters for effective depth**, which helps fit 23.7M params in 16MB but underperforms flat architecture per-layer.
+
+### 2. Hedge Mixer dominates and destabilizes
+
+Hedge gives ~−0.05 bpb lift over sliding but has huge variance:
+- **±0.013 bpb between seeds** (same config)
+- **±0.008 bpb between sessions** at identical model weights (sanity-run confirmed roundtrip/sliding match to 0.0002, hedge diverged 0.008)
+
+Most architectural gains get absorbed by hedge noise. Deterministic metrics (sliding, roundtrip) are the reliable signal.
+
+### 3. Weight-sharing saturates quickly
+
+On 3×4 recurrence:
+- loop_embed + cross_repeat_scales + value_scales already provide per-repeat variance
+- LoRA per-repeat on top **hurt** (+0.006 sliding) — the model was already using available capacity
+- Inter-repeat RMSNorm also hurt
+
+Additional per-repeat degrees of freedom have diminishing/negative returns.
+
+### 4. Progressive Depth schedule matters
+
+Shifting schedule from (0.40:2, 0.65:3, 1.0:4) to **(0.30:2, 0.50:3, 1.0:4)** gave −0.004 bpb — 55% more full-depth training steps. Combined with longer warmdown (3000 vs 2000) and denser SWA (every 30 vs 50) at higher start frac (0.6 vs 0.4) for ~44 averaged checkpoints.
+
+### 5. Mixed quantization > uniform
+
+Separating attn (int7, 63 levels) from MLP (int5, 16 levels):
+- Attention quality drop dominates total loss at low precision → keep attn higher
+- MLP tolerates aggressive quantization → allows 2MB saving
+- 2MB saved → model width up from d=832 mlp×2 → d=880 mlp×3
+
+Gain: −0.012 bpb.
+
+### 6. Calibration data makes GPTQ work
+
+Original percentile-search GPTQ ("GPTQ-lite" in our code) only optimizes per-row clip point via MSE. Full GPTQ with column-wise Hessian error compensation gave deterministic −0.002..−0.003 on sliding. Training-data calibration worked; AR self-gen calibration would likely stabilize further.
+
+## Files
+
+- Main submission: `records/track_non_record_16mb/2026-04-08_DepthRecurrence_Int7MixedQuant_HedgeMixer/` (PR #1453 backing)
+- 4-hour submission: PR #895
+- Experimental code variants in repo root: `train_gpt_refactored.py`, `train_gpt_exp1.py`, etc.
+
+## Reproduction
+
+Config used for PR #1453 (submitted):
+```
+MODEL_DIM=880 NUM_HEADS=8 NUM_KV_HEADS=4 MLP_MULT=3
+NUM_LAYERS=3 NUM_REPEATS=4
+QUANT_LEVELS=63 MLP_QUANT_LEVELS=16
+PROG_DEPTH="0.30:2,0.50:3,1.0:4"
+WARMDOWN_ITERS=3000
+SWA_START_FRAC=0.6 SWA_EVERY=30
+MATRIX_LR=0.018 MUON_WD=0.04
+XSA_LAST_N=4 QK_GAIN_INIT=1.5
+USE_HEDGE=1 HEDGE_ETA=0.1
+MAX_WALLCLOCK_SECONDS=600
+```
+
+3 seeds tested (1337, 42, 7) on 8× H100 SXM 80GB, PyTorch 2.5.1.
+
+## Resource footprint
+
+- RunPod compute grant: ~$950 of $1000 used
+- ~25 full training runs + calibration experiments
+- 1 person, 32 days
+
+## Acknowledgments
+
+Thanks to OpenAI for running this challenge and sponsoring the compute grant. Thanks to **abaybektursun**, **thwu1**, **Raahil Shah**, **Evangelinehelsinki** for publishing detailed submissions that informed several of my experiments (particularly GPTQ calibration, BigramHash sizing, and the noisy-QAT analysis for recurrent architectures).
diff --git a/..._non_record_16mb/2026-04-08_DepthRecurrence_Int7MixedQuant_HedgeMixer/README.md b/..._non_record_16mb/2026-04-08_DepthRecurrence_Int7MixedQuant_HedgeMixer/README.md
@@ -0,0 +1,135 @@
+# Non-record: Depth Recurrence + Int7 Mixed Quant + Parallel Hedge Mixer
+
+**val_bpb: 1.1324** (3-seed mean, std 0.0131) | **~15.40 MB** | 8×H100 SXM, 600s
+
+Improves on [PR #1384](https://github.com/openai/parameter-golf/pull/1384) (1.1441 bpb) by **−0.012 bpb** through mixed int7/int5 quantization enabling a wider MLP 3× model, and parallelized hedge mixer eval.
+
+## Results (8×H100 80GB SXM, PyTorch 2.5.1)
+
+| Seed | Steps | ms/step | Roundtrip | Sliding | **Hedge** | Artifact | Eval time |
+|------|-------|---------|-----------|---------|-----------|----------|-----------|
+| 1337 | 4,247 | 141.3ms | 1.2168 | 1.1832 | **1.1324** | 15.40 MB | 167s |
+| 42 | 4,389 | 136.7ms | 1.2172 | 1.1840 | **1.1454** | 15.28 MB | 164s |
+| 7 | 4,391 | 136.7ms | 1.2163 | 1.1828 | **1.1193** | 15.29 MB | 163s |
+| **Mean** | **4,342** | **138.2ms** | **1.2168** | **1.1834** | **1.1324** | | **~164s** |
+
+Additional seeds for variance analysis: seed 2024 → 1.1431, seed 99 → 1.1405. 5-seed mean: **1.1361** (std 0.0095).
+
+## Changes vs PR #1384 (1.1441 bpb)
+
+| Change | Effect | Impact |
+|--------|--------|--------|
+| MLP 2× → 3× (d=832→880) | +38% parameters, wider model | −0.013 sliding bpb |
+| Int8 → **Int7 attn** + Int5 MLP | Fits larger model in 16MB budget | enables above |
+| Earlier progressive depth (30/50 vs 40/65) | +55% full-depth training steps | −0.004 bpb |
+| More SWA (every 30, start 0.6) | 43 checkpoints vs 13 | smoother average |
+| Parallel hedge eval (8 GPU) | 580s → 164s eval time | fits 10 min budget |
+
+## Key Finding: Int7 Attention is the Sweet Spot
+
+Standard approaches use uniform quantization (all int8 or all int6). Experiments show that **attention and MLP weights have very different sensitivity to quantization**:
+
+- **Attention weights** directly affect the neural expert in hedge mixer. Int6 (31 levels) causes hedge boost to drop from −0.052 to −0.039 — a significant quality loss.
+- **MLP weights** tolerate aggressive quantization. Int5 (16 levels) compresses well with minimal quality impact.
+- **Int7 (63 levels)** for attention recovers hedge boost to −0.051, nearly matching int8's −0.052.
+
+The 2MB saved by using int5 MLP instead of int8 is reinvested into a wider model (d=880 with MLP 3× vs d=832 with MLP 2×).
+
+| Quant config | Model | Sliding | Hedge | Hedge boost | Size | Fits? |
+|-------------|-------|---------|-------|-------------|------|-------|
+| Int8 attn + Int5 MLP | d=896 | 1.1760 | 1.1349 | −0.041 | 17.4 MB | ✗ |
+| **Int7 attn + Int5 MLP** | **d=880** | **1.1832** | **1.1324** | **−0.051** | **15.4 MB** | **✓** |
+| Int6 attn + Int5 MLP | d=896 | 1.1870 | 1.1480 | −0.039 | 15.4 MB | ✓ |
+
+## Architecture: Depth Recurrence
+
+Instead of 9–11 unique transformer blocks, **3 shared blocks are repeated 4 times** (12 effective layers). This trades unique parameters for effective depth, fitting 23.7M parameters into ~15.4 MB.
+
+| Parameter | Value |
+|-----------|-------|
+| Layers × Repeats | 3 × 4 (12 effective) |
+| Model dim | 880 |
+| Heads / KV heads | 8 / 4 (head_dim=110) |
+| MLP multiplier | 3× (hidden=2640) |
+| Vocab size | 1024 (SP BPE) |
+| Parameters | 23.7M |
+| Logit softcap | 30.0 |
+
+### Recurrence components
+
+- **Cross-Repeat Skip**: Each block receives a weighted residual from its own output in the previous repeat — turns stateless recurrence into stateful
+- **Loop Embedding**: Learned per-layer vector added before each block — depth-wise positional encoding for shared weights
+- **Value Embeddings**: 2 extra embedding tables mixed into the residual stream at each effective layer with learned scales
+- **XSA (Exclusive Self-Attention)**: On last 4 effective layers — prevents attention collapse in deep recurrent models
+- **LeakyReLU(0.5)²**: Better gradient flow than ReLU² for deep/recurrent models
+
+## Progressive Depth Training
+
+Training uses increasing recurrence depth, recompiling at phase boundaries:
+
+| Phase | Wallclock | Repeats | Effective layers | Step speed |
+|-------|-----------|---------|-----------------|------------|
+| 0–30% | 0–180s | 2 | 6 | ~90ms |
+| 30–50% | 180–300s | 3 | 9 | ~105ms |
+| 50–100% | 300–600s | 4 | 12 | ~130ms |
+
+Schedule tuned for the MLP 3× config: earlier transitions (30/50% vs 40/65% in PR #1384) give +55% more steps at full depth. Warmdown 3000 iterations, SWA every 30 steps from LR scale < 0.6 (~43 checkpoints).
+
+## Eval: Parallel Hedge Mixer
+
+5-expert online ensemble with **8-GPU parallelized forward pass**:
+
+| Expert | Description |
+|--------|-------------|
+| Neural | Model's own logits (log-softmax) |
+| Unigram | Global token frequency with Laplace smoothing |
+| Bigram | Conditional P(token \| prev_token) |
+| Trigram | Hashed trigram context (65K buckets) |
+| Entropy | Model's entropy as calibration signal |
+
+**Parallelization**: Each batch of windows is split across 8 GPUs for the forward pass, logits gathered via `all_gather` to rank 0 for sequential mixer scoring. This reduces hedge eval from 580s (single GPU) to **164s**, fitting within the 10-minute eval budget.
+
+Hedge provides **−0.051 bpb improvement** over sliding window (1.1834 → 1.1324 mean).
+
+## Training Details
+
+| Parameter | Value |
+|-----------|-------|
+| Optimizer | Muon (matrices) + Adam (scalars, embeddings) |
+| Matrix / Scalar LR | 0.018 / 0.018 |
+| Tied embed LR | 0.021 |
+| Muon WD | 0.04 |
+| Muon momentum | 0.95 (warmup 0.85→0.95 over 500 steps) |
+| Grad clip | 0.3 |
+| Batch tokens | 524,288 |
+| Quantization | Int7 attn (63 levels) + Int5 MLP (16 levels) + zstd-22 |
+
+## Evolution
+
+| PR | Date | Score | What changed |
+|----|------|-------|-------------|
+| [#148](https://github.com/openai/parameter-golf/pull/148) | Mar 20 | 1.2196 (sliding) | Depth recurrence (3×4), cross-repeat skip, value embeddings |
+| [#784](https://github.com/openai/parameter-golf/pull/784) | Mar 25 | 1.2065 (sliding) | + XSA(4), LeakyReLU², GPTQ-lite, zstd-22 |
+| [#835](https://github.com/openai/parameter-golf/pull/835) | Mar 26 | 1.1980 (sliding) | + Progressive depth training (+30% steps) |
+| [#1384](https://github.com/openai/parameter-golf/pull/1384) | Apr 5 | 1.1441 (hedge) | + Hedge Mixer (5-expert eval-time ensemble) |
+| **This PR** | Apr 8 | **1.1324** (hedge) | + Int7 mixed quant, MLP 3×, d=880, parallel hedge |
+
+## Lineage
+
+- Depth recurrence architecture — original to this submission line
+- XSA from [PR #198](https://github.com/openai/parameter-golf/pull/198) (unnir)
+- LeakyReLU² from [PR #493](https://github.com/openai/parameter-golf/pull/493) (parinzee)
+- Mixed int5/int6 quantization concept from [PR #549](https://github.com/openai/parameter-golf/pull/549) (thwu1), extended here to int7
+- SWA, Muon WD from modded-nanogpt community
+
+## Reproducing
+
+```bash
+SEED=1337 QUANT_LEVELS=63 MLP_QUANT_LEVELS=16 \
+MODEL_DIM=880 NUM_HEADS=8 NUM_KV_HEADS=4 MLP_MULT=3 \
+NUM_LAYERS=3 NUM_REPEATS=4 XSA_LAST_N=4 NUM_VALUE_EMBEDS=2 \
+PROG_DEPTH="0.30:2,0.50:3,1.0:4" \
+WARMDOWN_ITERS=3000 SWA_START_FRAC=0.6 SWA_EVERY=30 \
+VOCAB_SIZE=1024 TRAIN_SEQ_LEN=1024 TRAIN_BATCH_TOKENS=524288 \
+torchrun --nproc_per_node=8 train_gpt.py
+```
diff --git a/...rack_non_record_16mb/2026-04-08_DepthRecurrence_Int7MixedQuant_HedgeMixer/submission.json b/...rack_non_record_16mb/2026-04-08_DepthRecurrence_Int7MixedQuant_HedgeMixer/submission.json
@@ -0,0 +1,16 @@
+{
+  "author": "Ivan Verbovoy",
+  "github_id": "iverbovoy",
+  "name": "Depth Recurrence + Int7 Mixed Quantization + Parallel Hedge Mixer",
+  "blurb": "3 shared blocks x 4 repeats (12 effective layers) with MLP 3x (d=880), progressive depth (2->3->4 repeats), int7 attention (63 levels) + int5 MLP (16 levels) mixed quantization, 8-GPU parallel Hedge Mixer eval. Key finding: int7 is the sweet spot for attention quantization — recovers 98% of int8 hedge quality while saving 2MB for a wider model. 5 seeds tested, 3-seed mean reported.",
+  "date": "2026-04-08T00:00:00Z",
+  "val_loss": 1.91197327,
+  "val_bpb": 1.13237601,
+  "roundtrip_val_bpb": 1.21676461,
+  "sliding_val_bpb": 1.18335612,
+  "seeds": [1337, 42, 7],
+  "mean_steps": 4342,
+  "wallclock_seconds": 600,
+  "eval_seconds": 164,
+  "bytes_model_int8_zstd22": 15403955
+}