Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# SP8192 + 11L MLP4x + Depth Recurrence + SDClip GPTQ + MuonEq-R + Pre-Quant TTT

## Score: val_bpb = 1.4794 (1xH100, seed=42, standard eval)

> **Note**: Tested on 1xH100 with `TRAIN_BATCH_TOKENS=65536` and `EVAL_STRIDE=0` (standard eval). On 8xH100 with `TRAIN_BATCH_TOKENS=786432`, pre-quant TTT, and sliding eval (stride=64), BPB expected to improve significantly.

14.09MB artifact (int6 SDClip GPTQ + brotli). 2896 steps in 600s on 1xH100.

## Approach

### Architecture: 11L MLP4x with Depth Recurrence + Parallel Residuals
- **11 physical layers**, model_dim=512, 8 query / 4 KV heads (GQA), MLP mult=4.0 (hidden=2048)
- **Depth recurrence**: Layers 3-5 re-executed once after full forward pass (14 virtual layers from 11 physical). Activated at step 3000 to let base model learn first.
- **Parallel residuals** (layers 7+): Attention and MLP run from same normalized input, merged additively. Reduces sequential dependency.
- **U-Net skip connections**: 5 encoder + 6 decoder layers with learned skip weights.

### Tokenizer: SP8192
SentencePiece BPE with 8192 vocab (from `kevclark/parameter-golf`). Larger vocab = fewer tokens per byte = lower BPB.

### Optimizer: MuonEq-R + AdamW
Row-equalized Muon: gradient rows normalized to unit L2 norm before Newton-Schulz iteration. This makes the optimizer invariant to row-scale variation. Matrix LR=0.022, WD=0.095, momentum warmup 0.92-0.99 over 1500 steps.

### Quantization: SDClip GPTQ + Brotli
SDClip sets clip threshold to `k * std(row)` instead of searching percentiles. k=12.85 for matrix weights, k=20.0 for embeddings. GPTQ with full Hessian calibration (66 layers). Brotli quality=11 compression with stride-2 byte shuffle.

### Pre-Quant TTT (Test-Time Training before quantization)
After training + EMA averaging, fine-tune on validation data for 10 epochs with AdamW (lr=0.00045, cosine decay to 0.1x, no WD). Freeze block 0. Runs on rank 0 only, weights broadcast to all ranks. Adapted weights baked into the GPTQ artifact (Track A legal).

### Additional Techniques
- **QK-Gain 5.25**: Per-head learnable scaling on Q-K dot products (enabled by SDClip)
- **SmearGate**: Adjacent token embedding blending
- **BigramHash(10240, dim=128)**: Hash-based bigram features
- **EMA 0.9965**: Exponential moving average
- **LeakyReLU squared**: `leaky_relu(x, 0.5).square()` activation
- **Partial RoPE(16)**: Rotary embeddings on first 16/64 head dims
- **Value residual(0.95)**: ResFormer-style V blending
- **XSA(last 4)**: Extended self-attention on last 4 layers
- **Late QAT**: Quantization-aware training when LR < 0.5x

## Hyperparameters

| Parameter | Value |
|-----------|-------|
| vocab_size | 8192 |
| num_layers | 11 (14 virtual with recurrence) |
| model_dim | 512 |
| num_heads / kv_heads | 8 / 4 |
| mlp_mult | 4.0 (hidden=2048) |
| train_seq_len | 2048 |
| train_batch_tokens | 786,432 (8xH100) / 65,536 (1xH100) |
| qk_gain_init | 5.25 |
| sdclip_k / sdclip_k_embed | 12.85 / 20.0 |
| matrix_lr | 0.022 |
| weight_decay | 0.095 |
| ema_decay | 0.9965 |
| warmdown_frac | 0.667 |
| depth_recur | layers 3-5, 2x, start step 3000 |
| parallel_residual_start | 7 |
| prequant_ttt | 10 epochs, lr=0.00045, freeze 1 block |

## Reproduction

```bash
# Install dependencies
pip install brotli sentencepiece

# Download SP8192 data
rm -f data/manifest.json
MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf \
python3 data/cached_challenge_fineweb.py --variant sp8192

# Train on 8xH100 (competition config)
torchrun --standalone --nproc_per_node=8 train_gpt.py

# Train on 1xH100 (validation)
TRAIN_BATCH_TOKENS=65536 EVAL_STRIDE=0 \
torchrun --standalone --nproc_per_node=1 train_gpt.py
```

## Ablation Results (sp1024, 2-min, 1xH100)

| Technique | final_bpb | Delta |
|-----------|-----------|-------|
| Baseline (10L MLP3x) | 3.574 | -- |
| + Depth Recurrence | 3.387 | **-5.2%** |
| + QK-Gain 5.0 (no SDClip) | 3.758 | +5.1% (GPTQ degrades) |
| + SDClip + QK-Gain 5.25 | Works | SDClip fixes GPTQ |
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
{
"author": "Dipkumar Patel",
"github_id": "dippatel1994",
"name": "SP8192 + 11L MLP4x + Depth Recurrence + SDClip GPTQ + MuonEq-R + Pre-Quant TTT + Brotli",
"blurb": "SP8192 tokenizer (from kevclark/parameter-golf), 11-layer model with MLP mult 4.0, depth recurrence (layers 3-5 looped 2x starting step 3000, 14 virtual layers), parallel residuals (layer 7+), SDClip GPTQ (k=12.85 for weights, k=20.0 for embeddings), MuonEq-R optimizer (row-normalized), QK-Gain 5.25, EMA 0.9965, pre-quant TTT (10 epochs AdamW with cosine decay on rank 0, weights broadcast), brotli compression with byte shuffle. Full frontier technique stack.",
"date": "2026-04-09T20:00:00Z",
"val_bpb": 1.4794,
"val_loss": 3.8215,
"note": "Tested on 1xH100 SXM (1/8 competition compute) with batch=65536 and standard eval (EVAL_STRIDE=0). On 8xH100 with batch=786432 + sliding eval + pre-quant TTT, BPB expected significantly lower.",
"seeds": [42],
"seed_results": {
"42": {"val_loss": 3.82151133, "val_bpb": 1.47942612}
},
"pre_quant_val_loss": 3.2562,
"pre_quant_val_bpb": 1.2606,
"step_stop": 2896,
"wallclock_seconds": 600.030,
"eval_time_seconds": 24.956,
"bytes_total": 14088870,
"bytes_model_int6_brotli": 14014157,
"bytes_code": 74713,
"gpu_config": "1xH100 SXM 80GB (competition: 8xH100)",
"tokenizer_source": "MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp8192"
}
Loading