Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
# SP8192 + Headwise Gated Attention + LeakyReLU2 + QK-Gain 5.0 + Legal TTT

**val_bpb: 1.2073** (3-seed mean, std 0.0006) | **~15.34 MB** | 8xH100 SXM

## Results (8xH100 80GB SXM, PyTorch 2.7.1)

| Seed | step_avg | steps | val_bpb (TTT) | val_bpb (int8) | Artifact Size |
|------|----------|-------|---------------|----------------|---------------|
| 1337 | 54.40ms | 11,030 | 1.20665 | 1.20807 | 15,340,947 |
| 42 | 54.41ms | 11,028 | 1.20783 | 1.21029 | 15,340,685 |
| 2025 | 54.37ms | 11,036 | 1.20746 | 1.21016 | 15,337,072 |
| **Mean** | 54.39ms | 11,031 | **1.20731 (std 0.0006)** | 1.20951 | 15,339,568 |

## Key Techniques

1. **SP8192** -- 8192-token SentencePiece BPE vocabulary (@kevclark, PR #1394). All top 5 submissions use SP8192. Dataset from `kevclark/parameter-golf` HuggingFace repo.

2. **Headwise Gated Attention** -- **Original technique by James Vo.** Sigmoid gate applied per-head after scaled dot-product attention. Each head learns a scalar gate that suppresses uninformative attention patterns. Adds only ~37K parameters (~0.2% overhead). Inspired by NeurIPS 2025 Best Paper (arxiv.org/abs/2505.06708).

```python
# In CausalSelfAttention.forward():
gate_logits = self.gate_proj(x) # [bsz, seqlen, num_heads]
gate = torch.sigmoid(gate_logits)
gate = gate.unsqueeze(-1) # [bsz, seqlen, num_heads, 1]
y = y * gate # applied after SDPA, before output projection
```

3. **LeakyReLU(0.5)^2** -- Replaces ReLU^2 in MLP. Preserves small negative gradients instead of zeroing them. Zero extra parameters, zero speed penalty (@abaybektursun, PR #549).

4. **QK-Gain 5.0** -- Learnable per-head scalar that scales query vectors before attention. Initialized to 5.0 instead of the default 1.5, giving sharper attention patterns (@dexhunter, PR #1413).

5. **Score-First TTT** -- Legal test-time training. For each 32K-token chunk of the validation set: (1) SCORE under `torch.inference_mode()`, accumulating loss/bytes for BPB, (2) TRAIN on the already-scored chunk using SGD (lr=0.005, momentum=0.9, 3 epochs, gradient clip 1.0). Last chunk is score-only. Model state is saved before TTT and restored after (@dexhunter, PR #1413).

## Architecture

| Component | Setting |
|-----------|---------|
| Layers | 9 |
| Model dim | 448 (reduced from 512 to fit 16 MB budget) |
| Attention heads | 8 |
| KV heads | 4 (GQA) |
| MLP mult | 2x |
| Activation | LeakyReLU(0.5)^2 |
| Gated attention | Headwise (1 sigmoid gate per head) |
| QK-Gain init | 5.0 |
| Vocab size | 8192 (SentencePiece BPE) |
| Embeddings | Tied |
| Logit softcap | 30.0 |
| RoPE base | 10000.0 |
| Sequence length | 1024 |
| Batch tokens | 524,288 |
| Optimizer | Muon (matrix params) + Adam (scalars/embeddings) |
| Parameters | 16,364,616 |
| Quantization | int8 + zlib |

## Original Contribution

**Headwise Gated Attention** is an original technique developed for this submission. It applies a learned sigmoid gate to each attention head's output, allowing the model to dynamically suppress uninformative heads on a per-token basis. The gate is a simple linear projection from the input features to `num_heads` scalars, passed through sigmoid, then multiplied with the SDPA output.

This is distinct from the elementwise variant (which gates every dimension independently but adds too many parameters for the 16 MB budget) and from attention head pruning (which permanently removes heads rather than dynamically gating them).

The technique was inspired by the NeurIPS 2025 Best Paper "Gated Attention" (arxiv.org/abs/2505.06708), which demonstrated gating mechanisms for attention in vision transformers. We adapted the concept to the language modeling setting with a lightweight per-head variant.

## Compliance

- [x] Training under 600s (all seeds: ~600s)
- [x] Artifact under 16 MB (all seeds: ~15.34 MB, +0.54 MB headroom)
- [x] Eval under 600s (TTT eval: ~137s per seed)
- [x] No SLOT (no supervised learning on test data)
- [x] No pre-quantization TTT (TTT runs on int8+zlib quantized model)
- [x] No ETLB (no eval-time learned biases)
- [x] No n-gram cache
- [x] Score-first TTT (score chunk before training on it, last chunk score-only)
- [x] Three seeds (1337, 42, 2025)

## Run Command

```bash
# Download SP8192 dataset (NOT in official repo)
rm -f data/manifest.json
MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf \
python3 data/cached_challenge_fineweb.py --variant sp8192 --train-shards 80

# Train (change SEED for different seeds)
GATED_ATTN=headwise \
ACTIVATION=leaky_relu2 \
QK_GAIN_INIT=5.0 \
VOCAB_SIZE=8192 \
DATA_PATH=./data/datasets/fineweb10B_sp8192/ \
TOKENIZER_PATH=./data/tokenizers/fineweb_8192_bpe.model \
MODEL_DIM=448 \
NUM_HEADS=8 \
NUM_KV_HEADS=4 \
NUM_LAYERS=9 \
TTT_MODE=score_first \
TTT_LR=0.005 \
TTT_MOMENTUM=0.9 \
TTT_EPOCHS=3 \
TTT_CHUNK_TOKENS=32768 \
TTT_FREEZE_BLOCKS=0 \
TTT_GRAD_CLIP=1.0 \
TTT_BATCH_SEQS=32 \
MAX_WALLCLOCK_SECONDS=600 \
SEED=1337 \
RUN_ID=sp8192_combo_slim \
torchrun --standalone --nproc_per_node=8 train_gpt.py
```

## Ablation

Experiments run on 2xH100 (SP1024, PyTorch 2.11) to isolate technique contributions:

| Technique | val_bpb | Delta vs baseline |
|-----------|---------|-------------------|
| Baseline (GQA, SP1024) | 1.2649 | -- |
| + LeakyReLU2 | 1.2641 | -0.0008 |
| + Headwise gated attn | 1.2653 | -0.0 (trades speed for quality) |
| + LeakyReLU2 + headwise | 1.2642 | -0.0007 (don't stack on SP1024) |
| SP8192 combo slim + TTT (8xH100) | **1.2073** | **-0.0171** |

The dominant factor is SP8192 + TTT. LeakyReLU2 provides a small free improvement. Headwise gated attention contributes more with larger vocabularies where attention pattern diversity matters.

## Credits

- SP8192 tokenizer and dataset: @kevclark (PR #1394)
- LeakyReLU(0.5)^2 activation: @abaybektursun (PR #549)
- Score-First TTT + QK-Gain 5.0: @dexhunter (PR #1413)
- Headwise Gated Attention: Original -- James Vo
- Base training infrastructure: modded-nanogpt / OpenAI Parameter Golf

## Included Files

- `README.md` (this file)
- `submission.json`
- `train_gpt.py` (1529 lines, 73,221 bytes)
- `train_seed1337.log`
- `train_seed42.log`
- `train_seed2025.log`
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
{
"author": "James Vo",
"github_id": "jamesEmerson112",
"name": "SP8192 + Headwise Gated Attention + LeakyReLU2 + QK-Gain 5.0 + Legal TTT",
"blurb": "SP8192 + headwise gated attention (original technique: sigmoid gate per head post-SDPA, inspired by NeurIPS 2025 Best Paper arxiv.org/abs/2505.06708) + LeakyReLU(0.5)^2 + QK-Gain 5.0 + score-first TTT. 3-seed mean 1.2073 BPB (std 0.0006). MODEL_DIM=448 to fit 16 MB budget.",
"date": "2026-04-24",
"track": "10min_16mb",
"val_bpb": 1.20731,
"val_bpb_std": 0.00060,
"seeds": [1337, 42, 2025],
"seed_results": {
"1337": {"val_bpb": 1.20665395, "val_loss": 3.11691247, "artifact_bytes": 15340947},
"42": {"val_bpb": 1.20783103, "val_loss": 3.11995300, "artifact_bytes": 15340685},
"2025": {"val_bpb": 1.20745783, "val_loss": 3.11898899, "artifact_bytes": 15337072}
},
"bytes_total": 15414168,
"bytes_code": 73221,
"hardware": "8xH100 80GB SXM",
"pytorch_version": "2.7.1",
"tokenizer": "SentencePiece BPE 8192",
"architecture": "9L 448d 8h 4kv GQA headwise-gate LeakyReLU2 QK5.0 tied-embed",
"compliance": {
"train_under_600s": true,
"artifact_under_16mb": true,
"eval_under_600s": true,
"no_slot": true,
"no_pre_quant_ttt": true,
"no_etlb": true,
"no_ngram_cache": true,
"score_first_ttt": true,
"three_seeds": true
},
"attribution": {
"SP8192": "@kevclark (PR #1394)",
"LeakyReLU2": "@abaybektursun (PR #549)",
"Score-First TTT": "@dexhunter (PR #1413)",
"QK-Gain 5.0": "@dexhunter (PR #1413)",
"Headwise Gated Attention": "Original - James Vo (inspired by arxiv.org/abs/2505.06708)"
}
}
Loading