Skip to content
Open
109 changes: 109 additions & 0 deletions V14_README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# V14: PR #1735 + TTT Weights EMA

**Base:** PR #1735 (AjAnubolu, 1.0429 BPB) — SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + 8-GPU Parallel Pre-Quant AdamW TTT

**Innovation:** Add EMA averaging to the 21-epoch pre-quant TTT phase. Instead of using the final epoch's weights, use an exponentially-weighted moving average across all epochs.

## Why This Should Help

AjAnubolu's TTT runs 21 epochs of AdamW with cosine LR (5e-4 -> 5e-5). At convergence, weights oscillate around a local optimum. Using only the LAST epoch's weights captures the noise. EMA averaging:

1. Smooths out late-epoch oscillation
2. Effectively averages multiple "good" local optima
3. Costs <1 second of compute and 0 bytes in artifact
4. Standard ML technique (used in DeepMind, OpenAI, Meta papers)

## Compliance

- **Inherits PR #1735's compliance status** (pre-quant TTT framework)
- **No additional risk**: EMA is a fixed averaging procedure, not val-loss-based selection
- **No new training**: just averages weights from existing 21 epochs

## Implementation

Two new env vars:

```bash
TTT_EMA_ENABLED=1 # default: 1 (on)
TTT_EMA_DECAY=0.7 # default: 0.7 (effective last-5-epochs window)
```

EMA logic (added to `pre_quant_adamw_ttt`):

```python
# Init: clone trainable params
ttt_ema_state = {n: p.data.clone() for n, p in model.named_parameters() if p.requires_grad}

# Each epoch: EMA update after all_reduce sync
for n, p in model.named_parameters():
if n in ttt_ema_state:
ttt_ema_state[n].mul_(0.7).add_(p.data, alpha=0.3)

# After all epochs: replace model with EMA
for n, p in model.named_parameters():
if n in ttt_ema_state:
p.data.copy_(ttt_ema_state[n])
```

## Usage on RunPod

```bash
# Clone this branch
cd /workspace
git clone -b v14-pr1735-ttt-ema https://github.com/alertcat/parameter-golf.git
cd parameter-golf

# Install deps (same as PR #1735)
pip install sentencepiece brotli zstandard
pip install flash_attn_3 --no-deps --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/

# Download SP8192 data
MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf \
python3 data/cached_challenge_fineweb.py --variant sp8192

# Train + eval (TTT EMA enabled by default)
cd records/track_10min_16mb/2026-04-18_SP8192_ParallelPreQuantTTT/
SEED=1337 \
torchrun --standalone --nproc_per_node=8 train_gpt.py
```

## Decision Points During Run

Watch for these log lines (TTT phase, last ~6 minutes of run):

```
prequant_ttt:start epochs=21 lr=0.0005 ...
ttt_ema:initialized decay=0.7 params=NN
prequant_ttt:epoch 1/21 val_bpb=1.06X ...
prequant_ttt:epoch 21/21 val_bpb=1.034X ... <- last epoch (baseline)
ttt_ema:loaded final EMA weights into model
ttt_ema:final val_bpb=1.0XX <- our metric (should be lower)
```

If `ttt_ema:final val_bpb` is **lower** than `prequant_ttt:epoch 21/21 val_bpb` -> EMA helped.
Then GPTQ quantizes the EMA weights, runs sliding eval -> final number.

## Expected Results

| Metric | PR #1735 (base) | V14 (this PR) | Delta |
|--------|----------------:|--------------:|------:|
| Pre-quant val_bpb | 1.034 | ~1.032 | -0.002 |
| Final sliding val_bpb | 1.0429 | ~1.040-1.042 | -0.001 to -0.003 |
| Artifact size | 15,991,294 | ~15,992,000 | ~+1KB (negligible) |

3-seed mean target: **1.040 BPB**

## Hyperparameter Tuning (if scout shows promise)

Try in this order:
1. `TTT_EMA_DECAY=0.5` (faster decay, last-3-epochs)
2. `TTT_EMA_DECAY=0.85` (slower, last-7-epochs)
3. `TTT_EMA_DECAY=0.95` (very slow, broad average)

## File Changes

- `records/track_10min_16mb/2026-04-18_SP8192_ParallelPreQuantTTT/train_gpt.py`: +60 lines (4 patch sites in `pre_quant_adamw_ttt`)
- `patch_v14_ttt_ema.py`: standalone patch script (regenerable)
- `V14_README.md`: this file

Net diff: ~+1500 bytes
111 changes: 111 additions & 0 deletions V15_README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
# V15: PR #1735 + CaseOps Tokenizer (TTT EMA disabled)

**Base:** PR #1735 (AjAnubolu, 1.0429 BPB)
**Innovation:** Add CaseOps lossless-case tokenizer (PR #1729) on top of pre-quant TTT stack

## What V15 Does

1. **Switches tokenizer** to `fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model` (lossless reversible Title/AllCaps/CapNext encoding)
2. **Adds byte sidecar support** to compute honest BPB (CaseOps adds control chars that would inflate naive byte counts)
3. **Disables TTT EMA** (V14 lesson: EMA hurts monotonic-decrease TTT)
4. **Falls back gracefully** to LUT-based byte counting when no sidecar exists

## Expected Result

| Metric | PR #1735 base | V15 (this) | Delta |
|--------|--------------:|-----------:|------:|
| Pre-quant TTT BPB | ~1.033 | ~1.025 | -0.008 |
| Final sliding BPB | 1.0429 | ~1.030-1.038 | -0.005 to -0.012 |
| Record threshold (1.0357) | NO | **YES (~50% prob)** | |

## Compliance Notes

- **CaseOps is lossless reversible** — original text can be recovered exactly
- **Byte sidecar uses RAW UTF-8 byte counts** (not transformed text) — honest BPB
- **No SLOT, no n-gram cache, no eval-time TTT** — inherits PR #1735 cleanliness
- **Pre-quant TTT remains unchanged** — same legal status as PR #1735

## Files Changed

- `records/track_10min_16mb/2026-04-18_SP8192_ParallelPreQuantTTT/train_gpt.py`
- Added `load_validation_token_bytes()` function
- Modified `ValidationData.__init__` to load sidecar
- Modified `eval_val()` to use sidecar
- Modified `eval_val_sliding()` to use sidecar
- Modified `eval_val_ttt()` to use sidecar
- Disabled TTT EMA by default (V14 lesson)
- `patch_v15_caseops.py`: standalone patch script
- `V15_README.md`: this file

## Usage on RunPod

### Step 1: Clone V15 branch

```bash
cd /workspace
rm -rf parameter-golf
git clone -b v15-pr1735-caseops https://github.com/alertcat/parameter-golf.git
cd parameter-golf

# Verify patches
grep -c "V15: Prefer byte sidecar" records/track_10min_16mb/2026-04-18_SP8192_ParallelPreQuantTTT/train_gpt.py
# Expected: 3
grep -c "load_validation_token_bytes" records/track_10min_16mb/2026-04-18_SP8192_ParallelPreQuantTTT/train_gpt.py
# Expected: >= 2
```

### Step 2: Install deps

```bash
pip install sentencepiece brotli zstandard huggingface-hub hf_transfer -q
pip install flash_attn_3 --no-deps --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/ -q
```

### Step 3: Download CaseOps dataset (~5 min, 16GB)

```bash
HF_HUB_ENABLE_HF_TRANSFER=1 python3 -c "
from huggingface_hub import snapshot_download
snapshot_download(
repo_id='romeerp/parameter-golf-caseops-v1',
repo_type='dataset',
local_dir='/workspace/caseops_data',
)
"

# Verify key files
ls /workspace/caseops_data/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved/ | grep -E "val_bytes|val_000000" | head -5
ls /workspace/caseops_data/datasets/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model
```

### Step 4: Run V15 scout seed

```bash
cd /workspace/parameter-golf/records/track_10min_16mb/2026-04-18_SP8192_ParallelPreQuantTTT/

SEED=1337 \
DATASETS_DIR=/workspace/caseops_data/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved \
TOKENIZER_PATH=/workspace/caseops_data/datasets/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model \
TTT_EMA_ENABLED=0 \
PREQUANT_TTT_ENABLED=1 \
PREQUANT_TTT_EPOCHS=21 \
torchrun --standalone --nproc_per_node=8 train_gpt.py 2>&1 | tee /workspace/scout_v15.log
```

**Watch for this log line confirming sidecar is active:**
```
val_bpb:byte_sidecar:enabled
```

If you see `val_bpb:byte_sidecar:disabled`, the dataset path is wrong — bytes won't be honest.

## Decision Points

After scout (~25 min), check `final_int6_sliding val_bpb`:

| BPB | Verdict |
|-----|---------|
| ≤ 1.0357 | 🔥 **BREAK RECORD** — run seeds 42 + 999, submit |
| 1.0358-1.040 | 👍 Strong, run 3 seeds |
| 1.040-1.045 | 😐 Worse than PR #1735 — investigate sidecar |
| > 1.045 | ❌ Failure — check `val_bpb:byte_sidecar:enabled` line |
Loading