openai · newjordan · Mar 28, 2026
diff --git a/records/track_10min_16mb/2026-03-27_Medusa_FLA_DeltaNet_NaiveInt6_8xH100/README.md b/records/track_10min_16mb/2026-03-27_Medusa_FLA_DeltaNet_NaiveInt6_8xH100/README.md
@@ -0,0 +1,71 @@
+# Medusa: Unstable — DeltaNet Crawler, Frugendorff Continuation
+
+**val_bpb: PENDING** (3-seed mean) | **~9.96MB** | 8xH100 SXM | Successor to PR #990 (ClownCar, 1.1813)
+
+> **Catalyst:** PR #875 (@shalyhinpavel, Pure Neural GDN, 1.0226 BPB) proved that Gated DeltaNet
+> is the dominant architecture for this competition. Medusa's DeltaNet integration is directly
+> symbiotic: the same `chunk_delta_rule` kernel powering GDN's state updates is active inside
+> the Frugendorff crawler topology here. Different architectures, same foundational mechanism.
+
+> **Stability note:** This submission shows significant cross-seed variance (see results table).
+> The DeltaNet heads introduce sensitivity not present in ClownCar (variance 0.00015).
+> Best seed is a genuine improvement. Research into stabilization is ongoing — Medusa_VII next.
+
+## Results
+
+| Seed | BPB (sliding window) | Size (int6+zstd) | Post-EMA BPB | Steps |
+|------|---------------------:|-----------------:|-------------:|------:|
+| 42   | **0.8104** ← best | 9.96MB | 0.2519 | 4872 |
+| 300  | 0.9578 | 9.97MB | 0.3882 | 4880 |
+| 1337 | 1.2269 | 9.96MB | 0.7126 | 4876 |
+| **Mean** | **0.9984** | | | |
+| **Std dev** | **0.1724** | | | |
+
+## What Changed vs PR #990 (ClownCar)
+
+| Change | Reason |
+|--------|--------|
+| `DELTA_NET_HEADS=4` | Canonical FLA DeltaNet enabled (vs 0 in ClownCar) |
+| `LOOP_AWARE_GPTQ=1` | 2-phase GPTQ calibration: phase 1 collects flat-layer Hessians, phase 2 collects crawler Hessians with quantized-flat activations — better approximation of inference conditions |
+| `EMA_START_STEP=4400` + `EMA_DECAY=0.99` | Late-start EMA re-initialized at warmdown onset, fast decay tracks warmdown weights closely |
+
+## Architecture
+
+- **Topology**: 4 flat layers + 1 crawler layer × 4 loops (Frugendorff compression)
+- **INST_DIM**: 32 (flow instructions)
+- **DeltaNet**: 4 heads, canonical `chunk_delta_rule` from `fla.ops.delta_rule`
+- **Quantization**: int6+zstd + CRAWLER_QUANT_INT8=1, loop-aware GPTQ (41 layers)
+- **Dims**: XSA_LAST_N=11, BIGRAM_VOCAB_SIZE=2048, ROPE_DIMS=16
+- **Schedule**: WARMDOWN_ITERS=2000, SWA_EVERY=50, EMA_START_STEP=4400
+- **N-gram eval**: DISABLED (sliding window only)
+
+## Known Issues
+
+The DeltaNet heads introduce cross-seed instability. Investigation identified two causes:
+1. **State dtype bug**: `chunk_delta_rule` returns Float32 `new_state` in BF16 training — fixed in follow-on work (Medusa_V: `new_state.to(dtype)`)
+2. **Quantization unravel**: DeltaNet weight errors compound through 4 crawler loops — active research area
+
+## Legality
+
+1. No n-gram eval — sliding window only
+2. No val data used during training
+3. int6 quantization runs inside training wallclock
+4. Score-first protocol not applicable (no n-gram cache)
+
+## Reproduce
+
+```bash
+SEED=300 bash experiments/Medusa_IV/run.sh
+SEED=1337 bash experiments/Medusa_IV/run.sh
+SEED=42 bash experiments/Medusa_IV/run.sh
+```
+
+8xH100 SXM, 600s training per seed.
+
+## Credits
+
+- **Gated DeltaNet (GDN) — primary catalyst**: @shalyhinpavel (PR #875) — proved GDN is the architecture for this competition at 1.0226 BPB pure neural. Medusa's DeltaNet integration is directly symbiotic: same `chunk_delta_rule` mechanism, applied inside the crawler topology.
+- **Canonical DeltaNet kernel**: `fla.ops.delta_rule` (flash-linear-attention)
+- **Loop-aware GPTQ**: @newjordan (Medusa series)
+- **Frugendorff crawler architecture + flow instructions**: @newjordan (PR #990)
+- **FX_Wing_Delta base**: @newjordan
diff --git a/records/track_10min_16mb/2026-03-27_Medusa_FLA_DeltaNet_NaiveInt6_8xH100/submission.json b/records/track_10min_16mb/2026-03-27_Medusa_FLA_DeltaNet_NaiveInt6_8xH100/submission.json
@@ -0,0 +1,36 @@
+{
+  "author": "Frosty40",
+  "github_id": "newjordan",
+  "name": "Medusa: DeltaNet (DELTA_NET_HEADS=4) + Loop-Aware GPTQ + Late-Start EMA",
+  "blurb": "Successor to PR #990 (ClownCar, 1.1813 BPB). Catalyzed by PR #875 (@shalyhinpavel, GDN 1.0226). Adds DELTA_NET_HEADS=4 (canonical chunk_delta_rule), loop-aware 2-phase GPTQ, late-start EMA (step 4400, decay=0.99). 4 flat + 1 crawler x 4 loops, INST_DIM=32. NOTE: this variant (Medusa_IV) has state dtype bug in eval path — see Medusa_V for fix.",
+  "date": "2026-03-28",
+  "seed_300": {
+    "val_bpb": 0.3736,
+    "sliding_window_bpb": 0.95777934,
+    "post_ema_bpb": 0.3882,
+    "steps": 4880,
+    "train_time_s": 600,
+    "eval_time_s": "~110s"
+  },
+  "seed_1337": {
+    "val_bpb": 0.6989,
+    "sliding_window_bpb": 1.22693269,
+    "post_ema_bpb": 0.7126,
+    "steps": 4876,
+    "train_time_s": 600,
+    "eval_time_s": "~108s"
+  },
+  "seed_42": {
+    "val_bpb": 0.2441,
+    "sliding_window_bpb": 0.81041025,
+    "post_ema_bpb": 0.2519,
+    "steps": 4872,
+    "train_time_s": 600,
+    "eval_time_s": "~124s"
+  },
+  "val_bpb": 0.9984,
+  "bytes_total": 10031847,
+  "bytes_code": 180226,
+  "hardware": "8xH100 SXM",
+  "notes": "High cross-seed variance (std dev 0.1724 vs ClownCar 0.00015). Best seed: 42 at 0.8104. DeltaNet heads introduce seed sensitivity. Stabilization ongoing in Medusa_VII."
+}