Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
79 changes: 79 additions & 0 deletions records/track_10min_16mb/2026-03-22_11L_SDTTT_XSA4_EMA/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# 11L XSA4 + EMA (PR379 base) - SDTTT disabled for legality

## Update (April 13, 2026)

@MatoTeziTanka correctly flagged this submission on 2026-04-12 via The Agora community compliance tracker. The original reported number (1.1287 3-seed mean) was computed with Self-Distillation TTT (SDTTT) running 2 epochs of SGD over all val_tokens before the sliding-window eval. That violates score-first discipline per Issue #402 and Issue #677.

Updating this submission to report the **pre-SDTTT post-swa numbers** from the same 3-seed runs. These are legal: they come from a clean `eval_val` run on the EMA-averaged weights, before any SDTTT adaptation touched val_tokens. The train logs for all 3 seeds still contain the clean DIAGNOSTIC post_swa lines that we're now using as the reported number.

The `train_gpt.py` in this folder has `sdttt_enabled` defaulting to `"0"` (line 140), so running it as-is produces legal results. The illegal numbers came from runs where `SDTTT_ENABLED=1` was set as an env var override.

Thanks to @MatoTeziTanka for the careful review and the explicit "pre-SDTTT score of 1.1448 would be clean" pointer that made this fix straightforward.

## Legal Summary

**Mean val_bpb = 1.1455** (3-seed mean, clean post-EMA eval, no SDTTT)

| Seed | val_bpb (post_swa, legal) | val_bpb (sliding + SDTTT, illegal) | Steps | ms/step | Artifact |
|------|---------------------------|-------------------------------------|-------|---------|----------|
| 7 | **1.1461** | 1.1294 | 6,084 | 99.5 | 15.7MB |
| 42 | **1.1457** | 1.1287 | 6,029 | 99.5 | 15.7MB |
| 1337 | **1.1448** | 1.1280 | 6,084 | 99.5 | 15.7MB |
| **Mean** | **1.1455** | 1.1287 | | | |

The post_swa column is the DIAGNOSTIC val_bpb computed after EMA weight averaging but before any eval-time adaptation. That is the legal predecessor of the submitted sliding_window number, which was contaminated by SDTTT. All artifacts still under 16MB.

## Submission progression

| PR | BPB | Technique added | Days in |
|----|-----|-----------------|---------|
| #273 | 1.1575 | Baseline SOTA, 10L | Day 1 |
| #385 | 1.1488 | WD=0.04, SWA=0.4, 11L | Day 2 |
| **This** | **1.1287** | XSA4 + EMA + SDTTT | Day 4 |

0.029 BPB improvement across 3 submissions in 4 days.

## Architecture (from PR #379)

- 11 layers, 512 dim, 8 heads, 4 KV heads (GQA), 3× MLP
- XSA (Exclusive Self-Attention) on last 4 layers
- EMA (decay=0.997) replacing SWA
- Int6 QAT with STE, zstd compression
- FP16 tied embedding passthrough
- SmearGate, U-Net skip connections

## Self-Distillation TTT

KL-divergence adaptation on validation data at eval time:
- Temperature: 2.0
- SGD lr: 0.001, momentum: 0.9
- 2 epochs, first 4 blocks frozen
- Preserves XSA attention patterns

## Note on step speed

Running at 99ms/step on stock PyTorch SDPA (no Flash Attention 3, no custom Triton kernels). The SOTA achieves ~55ms/step with FA3 + fused kernels. With equivalent infrastructure, this technique stack would yield significantly more training steps and better BPB.

## Total experiment context

40+ experiments, ~$150 RunPod spend, 4 days. Systematic analysis using a 231K-chunk searchable knowledge base spanning 50+ AI/ML experts.

## Run command

```bash
NCCL_NVLS_ENABLE=0 SEED=7 \
MAX_WALLCLOCK_SECONDS=600 \
SDTTT_ENABLED=1 SDTTT_EPOCHS=2 SDTTT_LR=0.001 \
SDTTT_TEMPERATURE=2.0 SDTTT_FREEZE_BLOCKS=4 \
GPTQ_ENABLED=0 \
torchrun --standalone --nproc_per_node=8 train_gpt.py
```

## Submission checklist

- [x] 3-seed verification (mean=1.1287, std=0.0007)
- [x] All artifacts < 16MB (15.7MB)
- [x] Wallclock < 600s on 8×H100
- [x] Train logs included (3 seeds)
- [x] Reproducible train_gpt.py included
- [x] README with detailed explanation
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
{
"track": "10min_16mb",
"date": "2026-03-22",
"name": "11L XSA4 + EMA (PR379 base) - SDTTT disabled for legality",
"author": "Nathan Maine",
"github_id": "NathanMaine",
"blurb": "PR379 architecture (XSA4, EMA, int6+zstd). Originally submitted with SDTTT eval-time adaptation, flagged by @MatoTeziTanka on 2026-04-12 as TTT-on-val. Updated to report pre-SDTTT post-swa numbers (3-seed mean 1.1455). SDTTT path remains in the code but defaults to disabled.",
"val_loss": 1.9342,
"val_bpb": 1.14553,
"seed_results": {
"7": {"val_loss": 1.9352, "val_bpb": 1.1461, "source": "DIAGNOSTIC post_swa (pre-SDTTT, legal)"},
"42": {"val_loss": 1.9344, "val_bpb": 1.1457, "source": "DIAGNOSTIC post_swa (pre-SDTTT, legal)"},
"1337": {"val_loss": 1.9329, "val_bpb": 1.1448, "source": "DIAGNOSTIC post_swa (pre-SDTTT, legal)"}
},
"mean_val_loss": 1.9342,
"mean_val_bpb": 1.14553,
"bytes_total": 15698717,
"bytes_code": 79975,
"legality_note": "Previous submission reported final_int6_sliding_window numbers (mean 1.1287) which were computed after SDTTT adapted on val_tokens for 2 epochs. That violates score-first discipline per Issue #402 and Issue #677. Reporting the clean post_swa (pre-SDTTT) numbers from the same 3-seed runs instead. SDTTT path is still in train_gpt.py but defaults to SDTTT_ENABLED=0."
}
Loading