Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
# Pre-Quant TTT 11ep + Val-Calibrated GPTQ + SLOT-24 — Quad-Stack Synthesis

**Status:** validation pending compute. Code is `py_compile` clean and is a focused patch on top of an existing record stack. Awaiting an 8xH100 SXM run.

Four val-data adaptations stacked for the first time:

1. **Pre-Quant AdamW TTT** — 11 epochs, `freeze_blocks=0`. Adapts FP weights to validation before quantization. Track A.
2. **Val-Calibrated GPTQ** — Hessian `H = X^T X` computed on validation activations instead of training activations. Aligns the one-shot quantization decision with the eval distribution. Track A.
3. **SLOT-24** — per-window AdamW optimization of a hidden delta `[bsz,1,dim]` + logit bias `[bsz,1,vocab]` on the frozen post-quant model. 24 steps, cosine LR `0.012 → 0.001`, stride 96. Throwaway parameters.
4. *(Optional)* **Eval-Time Legal Score-First TTT** — disabled by default in this synthesis (SLOT supersedes it for the same eval budget). Set `SLOT_ENABLED=0 TTT_ENABLED=1` to fall back.

Each component has independent precedent on this challenge. Their combination is novel.

## Why each piece

- **Pre-Quant TTT** recovers ~0.046 BPB on the FP weights (`1.0874 → 1.0415` in the base stack).
- **Val-Calibrated GPTQ** attacks the `0.0187` BPB quantization gap (`1.0415 → 1.0602`) by aligning quantization with the actual eval distribution. Was ablated on an older base only — never ported forward.
- **SLOT-24** then adds a per-sample throwaway delta on the frozen post-quant model. On weaker bases SLOT alone delivered ~`-0.23` BPB. Stacking it on the strongest pre-quant + val-calib base should push further.

## Time budget (8xH100 SXM)

| Stage | Estimated |
|---|---:|
| Train (wallclock cap) | 590 s |
| Pre-Quant AdamW TTT (11 ep) | ~190 s |
| Val-Calibrated GPTQ (Hessian collection on val) | ~10 s |
| Final int6 sliding window eval (baseline number) | ~80 s |
| **SLOT-24 eval (FINAL submission score)** | **~250 s** |
| **Total eval used** | **~530 s of 600 s** |

70 s headroom for variance. Fallback if budget pressure: `SLOT_STEPS=16` or `SLOT_BATCH_SEQS=48`.

## Diff against the base

Six focused patches in `train_gpt.py`. All training, optimization, EMA, GPTQ machinery, and architecture code is unchanged.

| Patch | Where | What |
|---|---|---|
| 1 | `Hyperparameters` | New `gptq_calib_source`, `slot_*` knobs. Pre-quant TTT defaults pushed to `epochs=11`, `freeze_blocks=0`. `qk_gain_init=5.5`. |
| 2 | `collect_hessians_val` (new) | Iterates `val_data.val_tokens` per-rank, all-reduces Hessians for a global val-data estimate. Reuses existing hooks / `CastedLinear` / `classify_param`. |
| 3 | `serialize` | Threads `val_data` through. Picks `collect_hessians_val` when `gptq_calib_source="val"`. Falls back to the original train-data path otherwise. |
| 4 | `GPT.forward_hidden` + `compute_logits` | Splits `forward_logits` into hidden + projection so SLOT can add the delta to the hidden state without re-running the transformer. |
| 5 | `eval_val_slot` (new) | Per-window throwaway-parameter optimization (`delta`, `logit_bias`), 24 cosine-decayed AdamW steps, scored under the optimized delta. |
| 6 | `run_evals` | Wires SLOT (and the optional legal TTT path) on a fresh post-quant model copy. |

## Compliance

- **Track A (artifact-baked):** Pre-Quant AdamW TTT trains weights on val before GPTQ — baked into the int6+brotli artifact. Val-Calibrated GPTQ computes activation statistics on val for a one-shot quantization decision (no weight gradients) — also baked into the artifact.
- **Track B / SLOT (frozen-model per-window):** model weights are never updated during eval. SLOT optimizes only per-window throwaway `delta` and `logit_bias`. Score-after-delta is the standard SLOT pattern.
- **Sliding-window eval** is causal, prefix-only.
- **No n-gram cache, no ETLB, no cross-window leakage.**
- All artifacts < 16 MB (inherits selective ±1 pruning to fit).

## Reproduction

```bash
git clone https://github.com/owizdom/parameter-golf
cd parameter-golf
pip install brotli sentencepiece kernels
pip install flash_attn_3 --no-deps --find-links \
https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/

MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf \
python3 data/cached_challenge_fineweb.py --variant sp8192

cd records/track_10min_16mb/2026-04-09_PreQuantTTT11_ValCalibGPTQ_SLOT24_Quad_Synthesis
bash run.sh
```

`run.sh` iterates `SEED ∈ {42, 1337, 2024}`. Each seed: ~10 min train + ~9 min eval. Final number is `final_int6_slot val_bpb` — the mean across the 3 seeds is the submission score.

See `VALIDATION.md` for RunPod step-by-step and the interpretation table.

## Files

| File | Purpose |
|---|---|
| `train_gpt.py` | The patched training + eval script |
| `README.md` | This file |
| `submission.json` | Metadata + projected range |
| `run.sh` | 3-seed runner with all env vars |
| `VALIDATION.md` | RunPod instructions, cost, fallback table |

## Credits

Building blocks reused from prior PRs:

- **PR #1487** — base `train_gpt.py`, Pre-Quant AdamW TTT, depth recurrence, parallel residuals, EMA, `MuonEq-R`, SDClip GPTQ machinery, 16 MB selective pruning.
- **PR #1485** — predecessor stack (3-layer recurrence + parallel residuals + EMA).
- **PR #1488 / #1313** — SLOT-24 reference implementation (`hidden_delta` + `logit_bias`, 24-step AdamW, stride masking).
- **PR #1019** — original Val-Calibrated GPTQ ablation; SDClip GPTQ + actorder + Cholesky machinery.
- **PR #1394** — SP8192 + GPTQ embeddings + `MuonEq-R` + depth recurrence.
- **PR #1413** — SP8192 base, legal score-first TTT framework.
- **PR #549** — original `LeakyReLU²` + score-first TTT + Parallel Muon.
- **PR #1412 / #1204** — parallel residuals.
- **PR #1423** — Pre-Quant AdamW TTT origin.
- **PR #1445** — hyperparameter tuning (`WD`, `MLR`, `EMA`, warmdown).
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
# Validation guide

This submission ships **without** validated `train_seed*.log` files. The code is syntactically verified (`python3 -m py_compile train_gpt.py` clean) and is a focused patch on the strongest open record stack.

To convert this from "non-record pending" to a record claim, someone with 8xH100 SXM access needs to run 3 seeds and post the logs.

## Cost estimate

| Item | Cost |
|---|---|
| 1× 8xH100 SXM hour on RunPod (community / spot) | $20-25 |
| 3 seeds × ~19 min wall = ~60 min compute | ~$15-25 |
| **Total realistic** | **$15-30** |

If you have an OpenAI Parameter Golf compute grant, the cost is $0.

## Step-by-step

### 1. Spin up a RunPod 8xH100 SXM pod

Use the official template: https://console.runpod.io/deploy?template=y5cejece4j&ref=nl2r56th
(linked from the parameter-golf README). Make sure SSH terminal access is enabled.

### 2. Clone and install

```bash
cd /workspace
git clone https://github.com/owizdom/parameter-golf
cd parameter-golf
git checkout synthesis-valgptq-stackedttt
pip install brotli sentencepiece kernels
pip install flash_attn_3 --no-deps --find-links \
https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/
```

### 3. Download the SP8192 dataset

```bash
MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf \
python3 data/cached_challenge_fineweb.py --variant sp8192
```

Takes ~5 min on RunPod's network. ~16 GB on disk.

### 4. Run the 3-seed sweep

```bash
cd records/track_10min_16mb/2026-04-09_PreQuantTTT11_ValCalibGPTQ_SLOT24_Quad_Synthesis
chmod +x run.sh
./run.sh
```

Wallclock budget per seed:

| Stage | Time |
|---|---:|
| Training (5161+ steps, hits 600s wallclock cap) | 590 s |
| Pre-Quant AdamW TTT (11 epochs) | ~190 s |
| Val-Calibrated GPTQ (Hessian collection on val) | ~10 s |
| Final int6 sliding window eval (baseline number) | ~80 s |
| **SLOT-24 eval (FINAL submission score)** | **~250 s** |
| **Total per seed** | **~19 min** |
| **Total for 3 seeds** | **~60 min** |

### 5. Read the results

After all 3 seeds complete, `run.sh` prints a summary block:

```
============ FINAL VAL_BPB BY SEED ============
--- seed 42 ---
val_calib_gptq:collected n_batches_per_rank=... global_batches=... layers=66
post-prequant-ttt val_loss:... val_bpb:1.04... # FP weights know val
final_int6_sliding_window val_loss:... val_bpb:1.06... # post-quant baseline
final_int6_slot val_loss:... val_bpb:0.8... # POST-QUANT + SLOT (FINAL)
slot_eval:done steps=24 stride=96 elapsed=...s val_loss=... val_bpb=0.8...
...
```

The submission `val_bpb` is the **mean of `final_int6_slot` across the 3 seeds**.

### 6. Interpret the result

| Mean `final_int6_slot` (3 seeds) | Verdict |
|---|---|
| ≤ 0.78 | **STRONG SOTA**, beats every open SLOT-using record |
| 0.78 - 0.86 | **Expected window** — the synthesis works, ship it |
| 0.86 - 0.95 | **Marginal** — pre-quant + val-calib stacking on SLOT didn't compound as expected; still substantial improvement |
| 0.95 - 1.05 | **SLOT underperforming** — try `SLOT_STEPS=32` and `SLOT_LR=0.014` |
| > 1.05 | **Regression** — disable SLOT (`SLOT_ENABLED=0 TTT_ENABLED=1`) and fall back to the legal-TTT path |

### 7. Update the submission

If the result is in or near the expected window:

```bash
# Edit submission.json: set val_bpb to your mean of final_int6_slot,
# set val_bpb_pending_compute to false, add per-seed numbers,
# set bytes_total to the artifact size from the logs.

# Rename the folder to bake in the actual val_bpb (matches PR #1487 convention):
cd records/track_10min_16mb
mv 2026-04-09_PreQuantTTT11_ValCalibGPTQ_SLOT24_Quad_Synthesis \
2026-04-09_PreQuantTTT11_ValCalibGPTQ_SLOT24_${VAL_BPB}

git add . && git commit -m "Validate quad-stack: val_bpb=${VAL_BPB} (3-seed mean)"
git push
# The PR will auto-update with the new commit
```

## Failure modes & fallbacks

| Symptom | Likely cause | Fallback |
|---|---|---|
| `final_int6_slot > final_int6_sliding_window` | SLOT destabilizing | `SLOT_LR=0.008`, or `SLOT_ENABLED=0 TTT_ENABLED=1` |
| Eval clock exceeds 600s | SLOT batch too slow | `SLOT_BATCH_SEQS=48` (faster) or `SLOT_STEPS=16` (cheaper) |
| `post-prequant-ttt > 1.05` | freeze=0 + 11 epochs over-trained FP | `PREQUANT_TTT_FREEZE_BLOCKS=1`, `PREQUANT_TTT_EPOCHS=10` |
| Val-calib makes things worse | distribution shift overfit | `GPTQ_CALIB_SOURCE=train` (reverts to PR #1487 path) |
| OOM during val-calib GPTQ | Hessian batch too large | `GPTQ_CALIBRATION_BATCHES=32` |

The fallbacks are independent — you can revert any single component without touching the others.
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
#!/usr/bin/env bash
# 3-seed runner for the Pre-Quant TTT + Val-Calib GPTQ + SLOT-24 quad-stack synthesis.
# Run this from the repo root after data download. Each seed: ~10 min train + ~9 min eval = ~19 min wall.
# Total wallclock for 3 seeds: ~60 min on 8xH100 SXM (~$3-5 per seed on RunPod).

set -euo pipefail

# Resolve script's own folder so we can write logs next to the script
SCRIPT_DIR="$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"
cd "$SCRIPT_DIR"

# Sanity: train_gpt.py must exist next to this script
if [ ! -f "train_gpt.py" ]; then
echo "ERROR: train_gpt.py not found in $SCRIPT_DIR" >&2
exit 1
fi

# Repo root has the data/ folder. We need DATA_DIR to point at it.
REPO_ROOT="$( cd "$SCRIPT_DIR/../../.." && pwd )"
export DATA_DIR="${DATA_DIR:-$REPO_ROOT/data/}"

if [ ! -d "$DATA_DIR/datasets/fineweb10B_sp8192" ]; then
echo "ERROR: SP8192 dataset not found at $DATA_DIR/datasets/fineweb10B_sp8192" >&2
echo " Run: MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp8192" >&2
exit 1
fi

# Hyperparameters for the synthesis. These match the README's expected gain table.
export VOCAB_SIZE=8192

# Pre-Quant TTT (Track A) — pushed harder than PR #1487
export PREQUANT_TTT_ENABLED=1
export PREQUANT_TTT_EPOCHS=11
export PREQUANT_TTT_FREEZE_BLOCKS=0
export PREQUANT_TTT_LR=0.00050
export PREQUANT_TTT_COSINE_DECAY=1

# Val-Calibrated GPTQ — Hessians computed on validation data
export GPTQ_CALIB_SOURCE=val

# SLOT-24 — per-window hidden delta + logit bias on the frozen post-quant model
# Replaces eval-time legal TTT in this synthesis (much bigger gain per eval second)
export SLOT_ENABLED=1
export SLOT_STEPS=24
export SLOT_LR=0.012
export SLOT_LR_MIN=0.001
export SLOT_BATCH_SEQS=32
export SLOT_EVAL_STRIDE=96

# Eval-Time Legal Score-First TTT — disabled by default (SLOT supersedes it)
# Set TTT_ENABLED=1 SLOT_ENABLED=0 to use this fallback path
export TTT_ENABLED=0
export TTT_LR=0.005
export TTT_EPOCHS=2
export TTT_FREEZE_BLOCKS=2
export TTT_CHUNK_TOKENS=32768
export TTT_MOMENTUM=0.9

# Architecture knobs (same as PR #1487 plus QK gain bump)
export QK_GAIN_INIT=5.5
export RECUR_LAYERS="3,4,5"
export RECUR_START_STEP=3000
export PARALLEL_START_LAYER=7
export EMA_DECAY=0.9965

# Run all 3 seeds for statistical significance
for SEED in 42 1337 2024; do
echo "============================================"
echo "=== Synthesis seed=$SEED GPUs=8 ==="
echo "============================================"
RUN_ID="synthesis_seed${SEED}" \
SEED=$SEED \
torchrun --standalone --nproc_per_node=8 train_gpt.py 2>&1 | tee "train_seed${SEED}.log"
echo "=== seed=$SEED done ==="
done

# Print the final per-seed numbers for quick review
echo ""
echo "============ FINAL VAL_BPB BY SEED ============"
for SEED in 42 1337 2024; do
echo "--- seed $SEED ---"
grep -E "(final_int6_sliding_window|final_int6_slot|final_int6_ttt|post-prequant-ttt|val_calib_gptq|slot_eval:done)" "train_seed${SEED}.log" || true
done
echo "==============================================="
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
{
"name": "Pre-Quant TTT 11ep + Val-Calibrated GPTQ + SLOT-24 — Quad-Stack Synthesis",
"author": "owizdom",
"github_id": "owizdom",
"date": "2026-04-09",
"track": "10min_16mb",
"val_bpb": null,
"val_bpb_pending_compute": true,
"val_bpb_projected_range": [0.78, 0.86],
"val_bpb_projected_center": 0.82,
"bytes_total": null,
"blurb": "Four val-data adaptations stacked for the first time on this challenge. (1) Pre-Quant AdamW TTT pushed to 11 epochs / freeze_blocks=0 (Track A, baked into artifact). (2) Val-Calibrated GPTQ — Hessian H=X^T X computed from validation activations instead of training activations, aligning the one-shot quant decision with the eval distribution (Track A, novel on the modern stack). (3) SLOT-24 — per-window AdamW optimization of a hidden delta and logit bias on the frozen post-quant model, 24 cosine-decayed steps, throwaway parameters (frozen-model adaptation, ported from PR #1488 / #1313). (4) Optional eval-time legal score-first TTT, disabled by default (SLOT supersedes it within the eval budget). Architecture, optimizer, training loop, EMA, and quantization machinery are unchanged from the PR #1487 base. Code: ~470 added lines in 6 focused patches; py_compile clean.",
"base_pr": 1487,
"base_val_bpb": 1.0600,
"validation_status": "pending_compute",
"validation_cost_estimate_usd": [15, 25],
"compliance": {
"track_a_artifact_baked": true,
"slot_frozen_model_per_window": true,
"score_before_update": true,
"single_pass": true,
"no_ngram_cache": true,
"no_etlb": true,
"no_cross_window_leakage": true
},
"techniques": [
"Pre-Quant AdamW TTT (11 epochs, freeze_blocks=0)",
"Val-Calibrated GPTQ (Hessians from val activations)",
"SLOT-24 (per-window hidden delta + logit bias, 24 AdamW steps)",
"3-layer depth recurrence (layers 3,4,5 -> 13 virtual)",
"Parallel residuals from layer 7+",
"EMA decay 0.9965",
"QK-Gain 5.5 (per-head learnable)",
"MuonEq-R optimizer",
"SDClip GPTQ int6 + int8 embeddings + brotli compression",
"Selective +-1 pruning to fit 16 MB",
"Sliding window eval (stride=64) for baseline reporting"
],
"credits": {
"pr1487": "ndokutovich — base train_gpt.py, Pre-Quant AdamW TTT, depth recurrence, parallel residuals, EMA, MuonEq-R, SDClip GPTQ machinery, 16 MB selective pruning",
"pr1485": "ndokutovich — predecessor stack",
"pr1488": "ndokutovich — SLOT + Pre-Quant TTT reference",
"pr1313": "anthony-maio — original SLOT-24 implementation",
"pr1019": "abaybektursun — val-calibrated GPTQ ablation; SDClip GPTQ + actorder + Cholesky machinery",
"pr1394": "clarkkev — SP8192 + GPTQ embeddings + MuonEq-R + depth recurrence",
"pr1413": "dexhunter — SP8192 base, legal score-first TTT framework",
"pr549": "abaybektursun — LeakyReLU2 + score-first TTT + Parallel Muon",
"pr1412": "Robby955 — parallel residuals",
"pr1204": "msisovic — parallel residuals",
"pr1423": "aryanbhosale — Pre-Quant AdamW TTT origin",
"pr1445": "X-Abhishek-X — hyperparameter tuning"
}
}
Loading