openai · owizdom · Apr 9, 2026 · Apr 9, 2026 · Apr 9, 2026
diff --git a/...0min_16mb/2026-04-09_PreQuantTTT11_ValCalibGPTQ_SLOT24_Quad_Synthesis/README.md b/...0min_16mb/2026-04-09_PreQuantTTT11_ValCalibGPTQ_SLOT24_Quad_Synthesis/README.md
@@ -0,0 +1,97 @@
+# Pre-Quant TTT 11ep + Val-Calibrated GPTQ + SLOT-24 — Quad-Stack Synthesis
+
+**Status:** validation pending compute. Code is `py_compile` clean and is a focused patch on top of an existing record stack. Awaiting an 8xH100 SXM run.
+
+Four val-data adaptations stacked for the first time:
+
+1. **Pre-Quant AdamW TTT** — 11 epochs, `freeze_blocks=0`. Adapts FP weights to validation before quantization. Track A.
+2. **Val-Calibrated GPTQ** — Hessian `H = X^T X` computed on validation activations instead of training activations. Aligns the one-shot quantization decision with the eval distribution. Track A.
+3. **SLOT-24** — per-window AdamW optimization of a hidden delta `[bsz,1,dim]` + logit bias `[bsz,1,vocab]` on the frozen post-quant model. 24 steps, cosine LR `0.012 → 0.001`, stride 96. Throwaway parameters.
+4. *(Optional)* **Eval-Time Legal Score-First TTT** — disabled by default in this synthesis (SLOT supersedes it for the same eval budget). Set `SLOT_ENABLED=0 TTT_ENABLED=1` to fall back.
+
+Each component has independent precedent on this challenge. Their combination is novel.
+
+## Why each piece
+
+- **Pre-Quant TTT** recovers ~0.046 BPB on the FP weights (`1.0874 → 1.0415` in the base stack).
+- **Val-Calibrated GPTQ** attacks the `0.0187` BPB quantization gap (`1.0415 → 1.0602`) by aligning quantization with the actual eval distribution. Was ablated on an older base only — never ported forward.
+- **SLOT-24** then adds a per-sample throwaway delta on the frozen post-quant model. On weaker bases SLOT alone delivered ~`-0.23` BPB. Stacking it on the strongest pre-quant + val-calib base should push further.
+
+## Time budget (8xH100 SXM)
+
+| Stage | Estimated |
+|---|---:|
+| Train (wallclock cap) | 590 s |
+| Pre-Quant AdamW TTT (11 ep) | ~190 s |
+| Val-Calibrated GPTQ (Hessian collection on val) | ~10 s |
+| Final int6 sliding window eval (baseline number) | ~80 s |
+| **SLOT-24 eval (FINAL submission score)** | **~250 s** |
+| **Total eval used** | **~530 s of 600 s** |
+
+70 s headroom for variance. Fallback if budget pressure: `SLOT_STEPS=16` or `SLOT_BATCH_SEQS=48`.
+
+## Diff against the base
+
+Six focused patches in `train_gpt.py`. All training, optimization, EMA, GPTQ machinery, and architecture code is unchanged.
+
+| Patch | Where | What |
+|---|---|---|
+| 1 | `Hyperparameters` | New `gptq_calib_source`, `slot_*` knobs. Pre-quant TTT defaults pushed to `epochs=11`, `freeze_blocks=0`. `qk_gain_init=5.5`. |
+| 2 | `collect_hessians_val` (new) | Iterates `val_data.val_tokens` per-rank, all-reduces Hessians for a global val-data estimate. Reuses existing hooks / `CastedLinear` / `classify_param`. |
+| 3 | `serialize` | Threads `val_data` through. Picks `collect_hessians_val` when `gptq_calib_source="val"`. Falls back to the original train-data path otherwise. |
+| 4 | `GPT.forward_hidden` + `compute_logits` | Splits `forward_logits` into hidden + projection so SLOT can add the delta to the hidden state without re-running the transformer. |
+| 5 | `eval_val_slot` (new) | Per-window throwaway-parameter optimization (`delta`, `logit_bias`), 24 cosine-decayed AdamW steps, scored under the optimized delta. |
+| 6 | `run_evals` | Wires SLOT (and the optional legal TTT path) on a fresh post-quant model copy. |
+
+## Compliance
+
+- **Track A (artifact-baked):** Pre-Quant AdamW TTT trains weights on val before GPTQ — baked into the int6+brotli artifact. Val-Calibrated GPTQ computes activation statistics on val for a one-shot quantization decision (no weight gradients) — also baked into the artifact.
+- **Track B / SLOT (frozen-model per-window):** model weights are never updated during eval. SLOT optimizes only per-window throwaway `delta` and `logit_bias`. Score-after-delta is the standard SLOT pattern.
+- **Sliding-window eval** is causal, prefix-only.
+- **No n-gram cache, no ETLB, no cross-window leakage.**
+- All artifacts < 16 MB (inherits selective ±1 pruning to fit).
+
+## Reproduction
+
+```bash
+git clone https://github.com/owizdom/parameter-golf
+cd parameter-golf
+pip install brotli sentencepiece kernels
+pip install flash_attn_3 --no-deps --find-links \
+  https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/
+
+MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf \
+  python3 data/cached_challenge_fineweb.py --variant sp8192
+
+cd records/track_10min_16mb/2026-04-09_PreQuantTTT11_ValCalibGPTQ_SLOT24_Quad_Synthesis
+bash run.sh
+```
+
+`run.sh` iterates `SEED ∈ {42, 1337, 2024}`. Each seed: ~10 min train + ~9 min eval. Final number is `final_int6_slot val_bpb` — the mean across the 3 seeds is the submission score.
+
+See `VALIDATION.md` for RunPod step-by-step and the interpretation table.
+
+## Files
+
+| File | Purpose |
+|---|---|
+| `train_gpt.py` | The patched training + eval script |
+| `README.md` | This file |
+| `submission.json` | Metadata + projected range |
+| `run.sh` | 3-seed runner with all env vars |
+| `VALIDATION.md` | RunPod instructions, cost, fallback table |
+
+## Credits
+
+Building blocks reused from prior PRs:
+
+- **PR #1487** — base `train_gpt.py`, Pre-Quant AdamW TTT, depth recurrence, parallel residuals, EMA, `MuonEq-R`, SDClip GPTQ machinery, 16 MB selective pruning.
+- **PR #1485** — predecessor stack (3-layer recurrence + parallel residuals + EMA).
+- **PR #1488 / #1313** — SLOT-24 reference implementation (`hidden_delta` + `logit_bias`, 24-step AdamW, stride masking).
+- **PR #1019** — original Val-Calibrated GPTQ ablation; SDClip GPTQ + actorder + Cholesky machinery.
+- **PR #1394** — SP8192 + GPTQ embeddings + `MuonEq-R` + depth recurrence.
+- **PR #1413** — SP8192 base, legal score-first TTT framework.
+- **PR #549** — original `LeakyReLU²` + score-first TTT + Parallel Muon.
+- **PR #1412 / #1204** — parallel residuals.
+- **PR #1423** — Pre-Quant AdamW TTT origin.
+- **PR #1445** — hyperparameter tuning (`WD`, `MLR`, `EMA`, warmdown).
diff --git a/..._16mb/2026-04-09_PreQuantTTT11_ValCalibGPTQ_SLOT24_Quad_Synthesis/VALIDATION.md b/..._16mb/2026-04-09_PreQuantTTT11_ValCalibGPTQ_SLOT24_Quad_Synthesis/VALIDATION.md
@@ -0,0 +1,121 @@
+# Validation guide
+
+This submission ships **without** validated `train_seed*.log` files. The code is syntactically verified (`python3 -m py_compile train_gpt.py` clean) and is a focused patch on the strongest open record stack.
+
+To convert this from "non-record pending" to a record claim, someone with 8xH100 SXM access needs to run 3 seeds and post the logs.
+
+## Cost estimate
+
+| Item | Cost |
+|---|---|
+| 1× 8xH100 SXM hour on RunPod (community / spot) | $20-25 |
+| 3 seeds × ~19 min wall = ~60 min compute | ~$15-25 |
+| **Total realistic** | **$15-30** |
+
+If you have an OpenAI Parameter Golf compute grant, the cost is $0.
+
+## Step-by-step
+
+### 1. Spin up a RunPod 8xH100 SXM pod
+
+Use the official template: https://console.runpod.io/deploy?template=y5cejece4j&ref=nl2r56th
+(linked from the parameter-golf README). Make sure SSH terminal access is enabled.
+
+### 2. Clone and install
+
+```bash
+cd /workspace
+git clone https://github.com/owizdom/parameter-golf
+cd parameter-golf
+git checkout synthesis-valgptq-stackedttt
+pip install brotli sentencepiece kernels
+pip install flash_attn_3 --no-deps --find-links \
+  https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/
+```
+
+### 3. Download the SP8192 dataset
+
+```bash
+MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf \
+  python3 data/cached_challenge_fineweb.py --variant sp8192
+```
+
+Takes ~5 min on RunPod's network. ~16 GB on disk.
+
+### 4. Run the 3-seed sweep
+
+```bash
+cd records/track_10min_16mb/2026-04-09_PreQuantTTT11_ValCalibGPTQ_SLOT24_Quad_Synthesis
+chmod +x run.sh
+./run.sh
+```
+
+Wallclock budget per seed:
+
+| Stage | Time |
+|---|---:|
+| Training (5161+ steps, hits 600s wallclock cap) | 590 s |
+| Pre-Quant AdamW TTT (11 epochs) | ~190 s |
+| Val-Calibrated GPTQ (Hessian collection on val) | ~10 s |
+| Final int6 sliding window eval (baseline number) | ~80 s |
+| **SLOT-24 eval (FINAL submission score)** | **~250 s** |
+| **Total per seed** | **~19 min** |
+| **Total for 3 seeds** | **~60 min** |
+
+### 5. Read the results
+
+After all 3 seeds complete, `run.sh` prints a summary block:
+
+```
+============ FINAL VAL_BPB BY SEED ============
+--- seed 42 ---
+val_calib_gptq:collected n_batches_per_rank=... global_batches=... layers=66
+post-prequant-ttt val_loss:... val_bpb:1.04...     # FP weights know val
+final_int6_sliding_window val_loss:... val_bpb:1.06...  # post-quant baseline
+final_int6_slot val_loss:... val_bpb:0.8...        # POST-QUANT + SLOT (FINAL)
+slot_eval:done steps=24 stride=96 elapsed=...s val_loss=... val_bpb=0.8...
+...
+```
+
+The submission `val_bpb` is the **mean of `final_int6_slot` across the 3 seeds**.
+
+### 6. Interpret the result
+
+| Mean `final_int6_slot` (3 seeds) | Verdict |
+|---|---|
+| ≤ 0.78 | **STRONG SOTA**, beats every open SLOT-using record |
+| 0.78 - 0.86 | **Expected window** — the synthesis works, ship it |
+| 0.86 - 0.95 | **Marginal** — pre-quant + val-calib stacking on SLOT didn't compound as expected; still substantial improvement |
+| 0.95 - 1.05 | **SLOT underperforming** — try `SLOT_STEPS=32` and `SLOT_LR=0.014` |
+| > 1.05 | **Regression** — disable SLOT (`SLOT_ENABLED=0 TTT_ENABLED=1`) and fall back to the legal-TTT path |
+
+### 7. Update the submission
+
+If the result is in or near the expected window:
+
+```bash
+# Edit submission.json: set val_bpb to your mean of final_int6_slot,
+# set val_bpb_pending_compute to false, add per-seed numbers,
+# set bytes_total to the artifact size from the logs.
+
+# Rename the folder to bake in the actual val_bpb (matches PR #1487 convention):
+cd records/track_10min_16mb
+mv 2026-04-09_PreQuantTTT11_ValCalibGPTQ_SLOT24_Quad_Synthesis \
+   2026-04-09_PreQuantTTT11_ValCalibGPTQ_SLOT24_${VAL_BPB}
+
+git add . && git commit -m "Validate quad-stack: val_bpb=${VAL_BPB} (3-seed mean)"
+git push
+# The PR will auto-update with the new commit
+```
+
+## Failure modes & fallbacks
+
+| Symptom | Likely cause | Fallback |
+|---|---|---|
+| `final_int6_slot > final_int6_sliding_window` | SLOT destabilizing | `SLOT_LR=0.008`, or `SLOT_ENABLED=0 TTT_ENABLED=1` |
+| Eval clock exceeds 600s | SLOT batch too slow | `SLOT_BATCH_SEQS=48` (faster) or `SLOT_STEPS=16` (cheaper) |
+| `post-prequant-ttt > 1.05` | freeze=0 + 11 epochs over-trained FP | `PREQUANT_TTT_FREEZE_BLOCKS=1`, `PREQUANT_TTT_EPOCHS=10` |
+| Val-calib makes things worse | distribution shift overfit | `GPTQ_CALIB_SOURCE=train` (reverts to PR #1487 path) |
+| OOM during val-calib GPTQ | Hessian batch too large | `GPTQ_CALIBRATION_BATCHES=32` |
+
+The fallbacks are independent — you can revert any single component without touching the others.
diff --git a/records/track_10min_16mb/2026-04-09_PreQuantTTT11_ValCalibGPTQ_SLOT24_Quad_Synthesis/run.sh b/records/track_10min_16mb/2026-04-09_PreQuantTTT11_ValCalibGPTQ_SLOT24_Quad_Synthesis/run.sh
@@ -0,0 +1,84 @@
+#!/usr/bin/env bash
+# 3-seed runner for the Pre-Quant TTT + Val-Calib GPTQ + SLOT-24 quad-stack synthesis.
+# Run this from the repo root after data download. Each seed: ~10 min train + ~9 min eval = ~19 min wall.
+# Total wallclock for 3 seeds: ~60 min on 8xH100 SXM (~$3-5 per seed on RunPod).
+
+set -euo pipefail
+
+# Resolve script's own folder so we can write logs next to the script
+SCRIPT_DIR="$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"
+cd "$SCRIPT_DIR"
+
+# Sanity: train_gpt.py must exist next to this script
+if [ ! -f "train_gpt.py" ]; then
+  echo "ERROR: train_gpt.py not found in $SCRIPT_DIR" >&2
+  exit 1
+fi
+
+# Repo root has the data/ folder. We need DATA_DIR to point at it.
+REPO_ROOT="$( cd "$SCRIPT_DIR/../../.." && pwd )"
+export DATA_DIR="${DATA_DIR:-$REPO_ROOT/data/}"
+
+if [ ! -d "$DATA_DIR/datasets/fineweb10B_sp8192" ]; then
+  echo "ERROR: SP8192 dataset not found at $DATA_DIR/datasets/fineweb10B_sp8192" >&2
+  echo "       Run: MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp8192" >&2
+  exit 1
+fi
+
+# Hyperparameters for the synthesis. These match the README's expected gain table.
+export VOCAB_SIZE=8192
+
+# Pre-Quant TTT (Track A) — pushed harder than PR #1487
+export PREQUANT_TTT_ENABLED=1
+export PREQUANT_TTT_EPOCHS=11
+export PREQUANT_TTT_FREEZE_BLOCKS=0
+export PREQUANT_TTT_LR=0.00050
+export PREQUANT_TTT_COSINE_DECAY=1
+
+# Val-Calibrated GPTQ — Hessians computed on validation data
+export GPTQ_CALIB_SOURCE=val
+
+# SLOT-24 — per-window hidden delta + logit bias on the frozen post-quant model
+# Replaces eval-time legal TTT in this synthesis (much bigger gain per eval second)
+export SLOT_ENABLED=1
+export SLOT_STEPS=24
+export SLOT_LR=0.012
+export SLOT_LR_MIN=0.001
+export SLOT_BATCH_SEQS=32
+export SLOT_EVAL_STRIDE=96
+
+# Eval-Time Legal Score-First TTT — disabled by default (SLOT supersedes it)
+# Set TTT_ENABLED=1 SLOT_ENABLED=0 to use this fallback path
+export TTT_ENABLED=0
+export TTT_LR=0.005
+export TTT_EPOCHS=2
+export TTT_FREEZE_BLOCKS=2
+export TTT_CHUNK_TOKENS=32768
+export TTT_MOMENTUM=0.9
+
+# Architecture knobs (same as PR #1487 plus QK gain bump)
+export QK_GAIN_INIT=5.5
+export RECUR_LAYERS="3,4,5"
+export RECUR_START_STEP=3000
+export PARALLEL_START_LAYER=7
+export EMA_DECAY=0.9965
+
+# Run all 3 seeds for statistical significance
+for SEED in 42 1337 2024; do
+  echo "============================================"
+  echo "=== Synthesis seed=$SEED  GPUs=8 ==="
+  echo "============================================"
+  RUN_ID="synthesis_seed${SEED}" \
+  SEED=$SEED \
+    torchrun --standalone --nproc_per_node=8 train_gpt.py 2>&1 | tee "train_seed${SEED}.log"
+  echo "=== seed=$SEED done ==="
+done
+
+# Print the final per-seed numbers for quick review
+echo ""
+echo "============ FINAL VAL_BPB BY SEED ============"
+for SEED in 42 1337 2024; do
+  echo "--- seed $SEED ---"
+  grep -E "(final_int6_sliding_window|final_int6_slot|final_int6_ttt|post-prequant-ttt|val_calib_gptq|slot_eval:done)" "train_seed${SEED}.log" || true
+done
+echo "==============================================="
diff --git a/...ck_10min_16mb/2026-04-09_PreQuantTTT11_ValCalibGPTQ_SLOT24_Quad_Synthesis/submission.json b/...ck_10min_16mb/2026-04-09_PreQuantTTT11_ValCalibGPTQ_SLOT24_Quad_Synthesis/submission.json
@@ -0,0 +1,53 @@
+{
+  "name": "Pre-Quant TTT 11ep + Val-Calibrated GPTQ + SLOT-24 — Quad-Stack Synthesis",
+  "author": "owizdom",
+  "github_id": "owizdom",
+  "date": "2026-04-09",
+  "track": "10min_16mb",
+  "val_bpb": null,
+  "val_bpb_pending_compute": true,
+  "val_bpb_projected_range": [0.78, 0.86],
+  "val_bpb_projected_center": 0.82,
+  "bytes_total": null,
+  "blurb": "Four val-data adaptations stacked for the first time on this challenge. (1) Pre-Quant AdamW TTT pushed to 11 epochs / freeze_blocks=0 (Track A, baked into artifact). (2) Val-Calibrated GPTQ — Hessian H=X^T X computed from validation activations instead of training activations, aligning the one-shot quant decision with the eval distribution (Track A, novel on the modern stack). (3) SLOT-24 — per-window AdamW optimization of a hidden delta and logit bias on the frozen post-quant model, 24 cosine-decayed steps, throwaway parameters (frozen-model adaptation, ported from PR #1488 / #1313). (4) Optional eval-time legal score-first TTT, disabled by default (SLOT supersedes it within the eval budget). Architecture, optimizer, training loop, EMA, and quantization machinery are unchanged from the PR #1487 base. Code: ~470 added lines in 6 focused patches; py_compile clean.",
+  "base_pr": 1487,
+  "base_val_bpb": 1.0600,
+  "validation_status": "pending_compute",
+  "validation_cost_estimate_usd": [15, 25],
+  "compliance": {
+    "track_a_artifact_baked": true,
+    "slot_frozen_model_per_window": true,
+    "score_before_update": true,
+    "single_pass": true,
+    "no_ngram_cache": true,
+    "no_etlb": true,
+    "no_cross_window_leakage": true
+  },
+  "techniques": [
+    "Pre-Quant AdamW TTT (11 epochs, freeze_blocks=0)",
+    "Val-Calibrated GPTQ (Hessians from val activations)",
+    "SLOT-24 (per-window hidden delta + logit bias, 24 AdamW steps)",
+    "3-layer depth recurrence (layers 3,4,5 -> 13 virtual)",
+    "Parallel residuals from layer 7+",
+    "EMA decay 0.9965",
+    "QK-Gain 5.5 (per-head learnable)",
+    "MuonEq-R optimizer",
+    "SDClip GPTQ int6 + int8 embeddings + brotli compression",
+    "Selective +-1 pruning to fit 16 MB",
+    "Sliding window eval (stride=64) for baseline reporting"
+  ],
+  "credits": {
+    "pr1487": "ndokutovich — base train_gpt.py, Pre-Quant AdamW TTT, depth recurrence, parallel residuals, EMA, MuonEq-R, SDClip GPTQ machinery, 16 MB selective pruning",
+    "pr1485": "ndokutovich — predecessor stack",
+    "pr1488": "ndokutovich — SLOT + Pre-Quant TTT reference",
+    "pr1313": "anthony-maio — original SLOT-24 implementation",
+    "pr1019": "abaybektursun — val-calibrated GPTQ ablation; SDClip GPTQ + actorder + Cholesky machinery",
+    "pr1394": "clarkkev — SP8192 + GPTQ embeddings + MuonEq-R + depth recurrence",
+    "pr1413": "dexhunter — SP8192 base, legal score-first TTT framework",
+    "pr549": "abaybektursun — LeakyReLU2 + score-first TTT + Parallel Muon",
+    "pr1412": "Robby955 — parallel residuals",
+    "pr1204": "msisovic — parallel residuals",
+    "pr1423": "aryanbhosale — Pre-Quant AdamW TTT origin",
+    "pr1445": "X-Abhishek-X — hyperparameter tuning"
+  }
+}