Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# Medusa: Unstable — DeltaNet Crawler, Frugendorff Continuation

**val_bpb: PENDING** (3-seed mean) | **~9.96MB** | 8xH100 SXM | Successor to PR #990 (ClownCar, 1.1813)

> **Catalyst:** PR #875 (@shalyhinpavel, Pure Neural GDN, 1.0226 BPB) proved that Gated DeltaNet
> is the dominant architecture for this competition. Medusa's DeltaNet integration is directly
> symbiotic: the same `chunk_delta_rule` kernel powering GDN's state updates is active inside
> the Frugendorff crawler topology here. Different architectures, same foundational mechanism.

> **Stability note:** This submission shows significant cross-seed variance (see results table).
> The DeltaNet heads introduce sensitivity not present in ClownCar (variance 0.00015).
> Best seed is a genuine improvement. Research into stabilization is ongoing — Medusa_VII next.

## Results

| Seed | BPB (sliding window) | Size (int6+zstd) | Post-EMA BPB | Steps |
|------|---------------------:|-----------------:|-------------:|------:|
| 42 | **0.8104** ← best | 9.96MB | 0.2519 | 4872 |
| 300 | 0.9578 | 9.97MB | 0.3882 | 4880 |
| 1337 | 1.2269 | 9.96MB | 0.7126 | 4876 |
| **Mean** | **0.9984** | | | |
| **Std dev** | **0.1724** | | | |

## What Changed vs PR #990 (ClownCar)

| Change | Reason |
|--------|--------|
| `DELTA_NET_HEADS=4` | Canonical FLA DeltaNet enabled (vs 0 in ClownCar) |
| `LOOP_AWARE_GPTQ=1` | 2-phase GPTQ calibration: phase 1 collects flat-layer Hessians, phase 2 collects crawler Hessians with quantized-flat activations — better approximation of inference conditions |
| `EMA_START_STEP=4400` + `EMA_DECAY=0.99` | Late-start EMA re-initialized at warmdown onset, fast decay tracks warmdown weights closely |

## Architecture

- **Topology**: 4 flat layers + 1 crawler layer × 4 loops (Frugendorff compression)
- **INST_DIM**: 32 (flow instructions)
- **DeltaNet**: 4 heads, canonical `chunk_delta_rule` from `fla.ops.delta_rule`
- **Quantization**: int6+zstd + CRAWLER_QUANT_INT8=1, loop-aware GPTQ (41 layers)
- **Dims**: XSA_LAST_N=11, BIGRAM_VOCAB_SIZE=2048, ROPE_DIMS=16
- **Schedule**: WARMDOWN_ITERS=2000, SWA_EVERY=50, EMA_START_STEP=4400
- **N-gram eval**: DISABLED (sliding window only)

## Known Issues

The DeltaNet heads introduce cross-seed instability. Investigation identified two causes:
1. **State dtype bug**: `chunk_delta_rule` returns Float32 `new_state` in BF16 training — fixed in follow-on work (Medusa_V: `new_state.to(dtype)`)
2. **Quantization unravel**: DeltaNet weight errors compound through 4 crawler loops — active research area

## Legality

1. No n-gram eval — sliding window only
2. No val data used during training
3. int6 quantization runs inside training wallclock
4. Score-first protocol not applicable (no n-gram cache)

## Reproduce

```bash
SEED=300 bash experiments/Medusa_IV/run.sh
SEED=1337 bash experiments/Medusa_IV/run.sh
SEED=42 bash experiments/Medusa_IV/run.sh
```

8xH100 SXM, 600s training per seed.

## Credits

- **Gated DeltaNet (GDN) — primary catalyst**: @shalyhinpavel (PR #875) — proved GDN is the architecture for this competition at 1.0226 BPB pure neural. Medusa's DeltaNet integration is directly symbiotic: same `chunk_delta_rule` mechanism, applied inside the crawler topology.
- **Canonical DeltaNet kernel**: `fla.ops.delta_rule` (flash-linear-attention)
- **Loop-aware GPTQ**: @newjordan (Medusa series)
- **Frugendorff crawler architecture + flow instructions**: @newjordan (PR #990)
- **FX_Wing_Delta base**: @newjordan
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
{
"author": "Frosty40",
"github_id": "newjordan",
"name": "Medusa: DeltaNet (DELTA_NET_HEADS=4) + Loop-Aware GPTQ + Late-Start EMA",
"blurb": "Successor to PR #990 (ClownCar, 1.1813 BPB). Catalyzed by PR #875 (@shalyhinpavel, GDN 1.0226). Adds DELTA_NET_HEADS=4 (canonical chunk_delta_rule), loop-aware 2-phase GPTQ, late-start EMA (step 4400, decay=0.99). 4 flat + 1 crawler x 4 loops, INST_DIM=32. NOTE: this variant (Medusa_IV) has state dtype bug in eval path — see Medusa_V for fix.",
"date": "2026-03-28",
"seed_300": {
"val_bpb": 0.3736,
"sliding_window_bpb": 0.95777934,
"post_ema_bpb": 0.3882,
"steps": 4880,
"train_time_s": 600,
"eval_time_s": "~110s"
},
"seed_1337": {
"val_bpb": 0.6989,
"sliding_window_bpb": 1.22693269,
"post_ema_bpb": 0.7126,
"steps": 4876,
"train_time_s": 600,
"eval_time_s": "~108s"
},
"seed_42": {
"val_bpb": 0.2441,
"sliding_window_bpb": 0.81041025,
"post_ema_bpb": 0.2519,
"steps": 4872,
"train_time_s": 600,
"eval_time_s": "~124s"
},
"val_bpb": 0.9984,
"bytes_total": 10031847,
"bytes_code": 180226,
"hardware": "8xH100 SXM",
"notes": "High cross-seed variance (std dev 0.1724 vs ClownCar 0.00015). Best seed: 42 at 0.8104. DeltaNet heads introduce seed sensitivity. Stabilization ongoing in Medusa_VII."
}
Loading