diff --git a/records/track_10min_16mb/2026-03-21_XSA_EMA_TTT/README.md b/records/track_10min_16mb/2026-03-21_XSA_EMA_TTT/README.md new file mode 100644 index 0000000000..9835f4570b --- /dev/null +++ b/records/track_10min_16mb/2026-03-21_XSA_EMA_TTT/README.md @@ -0,0 +1,69 @@ +# XSA + EMA + TTT: Test-Time Training on Exclusive Self-Attention Base + +**Non-record research submission** | val_bpb: 1.1436 (sliding window, stride=64) | Artifact: 15.3MB + +## Summary + +This submission tests whether Test-Time Training (TTT) improves upon the XSA + EMA base (PR #287, 1.1280 BPB). The answer is **no — TTT hurts by 0.016 BPB**, confirming the mechanism redundancy pattern identified in #140. + +## Key Finding: TTT + XSA Don't Stack + +| Configuration | val_bpb | Source | +|---|---|---| +| XSA + EMA (no TTT) | **1.1280** | PR #287 | +| XSA + EMA + TTT | **1.1436** | This submission | +| SmearGate + TTT (no XSA) | 1.1313 | PR #254 | +| SmearGate only | 1.1326 | PR #198 | + +**TTT makes the XSA+EMA model 0.016 worse.** For comparison, TTT improves non-XSA models by ~0.013 (PR #254 vs #198). This confirms that XSA and TTT are mechanistically redundant — both improve local context modeling, so stacking them yields negative returns due to TTT's distribution drift. + +This extends the earlier finding from PR #296 (Error-Guided TTT negative result) and PR #290 (XSA+TTT underperforms XSA-alone). + +## Reproducibility (2 seeds) + +| Seed | Steps | Sliding s64 | Artifact | +|------|-------|-------------|----------| +| 1337 | 6,001 | 1.1436 | 15,283,544 | +| 42 | 5,978 | 1.1441 | 15,283,544 | +| **Mean** | | **1.1439** | | + +## Method + +**Base model**: PR #287's recipe — 11L, 512dim, 3x MLP, int6+zstd, SmearGate, BigramHash(2048), OrthoInit, EMA (decay=0.997), XSA on last 4 layers, Muon WD=0.04. + +**TTT**: PR #254's recipe — 3 epochs of full-model SGD (lr=0.002, momentum=0.9) on validation data, first 2 blocks frozen, gradient clipping at 1.0. Applied after int6 dequantization, before sliding window eval. TTT takes 67 seconds. + +**FA2 compatibility**: Used `flash_attn` (FA2) instead of `flash_attn_interface` (FA3) due to environment constraints. This costs ~500 training steps (6,001 vs ~7,100 with FA3), partially explaining the gap vs PR #287's reported 1.1280. + +## Configuration + +```bash +NUM_LAYERS=11 BIGRAM_VOCAB_SIZE=2048 XSA_LAST_N=4 \ +EMA_ENABLED=1 EMA_DECAY=0.997 SWA_ENABLED=0 \ +TTT_ENABLED=1 TTT_LR=0.002 TTT_EPOCHS=3 TTT_FREEZE_BLOCKS=2 \ +torchrun --nproc_per_node=8 train_gpt.py +``` + +## Analysis: Why TTT Hurts XSA Models + +Three hypotheses for the negative interaction: + +1. **Mechanism redundancy**: XSA removes self-information from attention outputs, forcing the model to rely on context from other tokens. TTT's gradient updates optimize for the same local context signal that XSA already provides — double-counting the same information source. + +2. **EMA weight disruption**: The base model uses EMA-averaged weights (smooth, well-generalized). TTT's SGD updates introduce noise that moves weights away from the EMA optimum. Unlike SWA models where weights are already "rougher," EMA weights are more sensitive to perturbation. + +3. **Distribution drift**: 3 epochs of SGD on validation data shifts the model's internal representations enough that the quantized-then-dequantized weight structure no longer aligns with the adapted features. The int6 quantization grid was optimized for the original EMA weights, not the TTT-adapted ones. + +## Implications + +- **For the competition**: XSA+EMA without TTT remains the strongest eval strategy. TTT should be reserved for non-XSA bases. +- **For research**: The redundancy between attention-output modifications (XSA) and weight adaptation (TTT) suggests they target the same information bottleneck — local context modeling beyond the attention window. +- **Open question**: Would TTT with much lower LR (e.g., 0.0002) or fewer epochs (1) avoid the distribution drift while still providing marginal gains? + +## Hardware + +8x NVIDIA H100 80GB SXM, RunPod. Training: 600s. TTT: 67s. Eval: 82s. + +## Author + +Xiaoan Liu | NYU | GitHub: @sseanliu diff --git a/records/track_10min_16mb/2026-03-21_XSA_EMA_TTT/submission.json b/records/track_10min_16mb/2026-03-21_XSA_EMA_TTT/submission.json new file mode 100644 index 0000000000..8799f7e50e --- /dev/null +++ b/records/track_10min_16mb/2026-03-21_XSA_EMA_TTT/submission.json @@ -0,0 +1,16 @@ +{ + "track": "10min_16mb", + "date": "2026-03-21", + "name": "XSA + EMA + TTT (negative interaction study)", + "author": "Xiaoan Liu", + "github_id": "sseanliu", + "blurb": "Tests TTT on the XSA+EMA base (PR #287). Key finding: TTT hurts by 0.016 BPB — XSA and TTT are mechanistically redundant, both targeting local context modeling. Extends the TTT-XSA redundancy pattern from #290 and error-guided TTT negative result from #296.", + "seed_results": { + "1337": {"val_loss": 1.93099434, "val_bpb": 1.14364664, "steps": 6001, "ms_per_step": 99.88}, + "42": {"val_loss": 1.93182801, "val_bpb": 1.14414039, "steps": 5978, "ms_per_step": 100.30} + }, + "mean_val_bpb": 1.14389, + "artifact_bytes": 15283544, + "code_bytes": 70009, + "notes": "Non-record submission. TTT makes XSA+EMA model worse (1.1280 -> 1.1439). Used FA2 instead of FA3 (costs ~500 steps). TTT config: lr=0.002, momentum=0.9, 3 epochs, freeze first 2 blocks." +} diff --git a/records/track_10min_16mb/2026-03-21_XSA_EMA_TTT/train_gpt.py b/records/track_10min_16mb/2026-03-21_XSA_EMA_TTT/train_gpt.py new file mode 100644 index 0000000000..e69de29bb2