Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
# Record: Causal BackoffNgramMixer — val_bpb 0.3958 (3-seed mean)

## Summary

- **val_bpb: 0.3958** (3-seed mean, std 0.0011)
- Seeds: 7 (0.3948), 1337 (0.3957), 2024 (0.3969)
- 11L transformer (28M params) with LeakyReLU(0.75)², Parallel Muon, MTP heads=2
- **Causal BackoffNgramMixer**: orders 2–10, 4M flat hash buckets, entropy-adaptive alpha
- **Batched sliding-window eval with incremental n-gram updates** — score-first, then update counts after each batch. Strictly backward-looking, causal.
- Artifacts: 15,940,706 – 15,957,577 bytes (all under 16MB)
- Eval times: 583 – 596 seconds (all under 600s)
- Training: 6,987 steps in 600s on 8×H100 SXM
- Eval: ~226s (within 10-minute eval budget)
- Beats previous best BackoffNgramMixer (#803 at 0.4416) by **0.0392 BPB**

## Key Innovation: Swarm-Designed Architecture + Causal N-gram Eval

This submission was designed by a multi-agent Think Tank Swarm — a research system with 4 autonomous agents and a 500K-node typed-edge knowledge graph. The swarm ran investigation missions to evaluate training approaches, then the knowledge graph conditioned embedding initialization for semantically important tokens.

The compression gains come from the **BackoffNgramMixer at eval time**, not the swarm. The swarm's contribution is architectural: it designed the approach, selected the hyperparameters, and provides transparent decision logging during training. We are explicit about this — the swarm is the research system, the mixer is the compression engine.

| Configuration | BPB | Source |
|---|---|---|
| Neural baseline (sliding window, stride=64) | 1.1245 | Our training |
| + Causal BackoffNgramMixer (orders 2–10) | **0.4024** | This submission |
| Previous best n-gram (#803) | 0.4416 | @pentxayc |

The key difference from #803: our causal sequential chunk evaluation processes the full 62M-token validation set in order on every GPU rank (no sharding), building complete n-gram statistics. This gives higher-order n-grams (7–10) much stronger count statistics than rank-sharded approaches.

## Eval Stack

- **BackoffNgramMixer**: orders 2–10, 4,194,304 flat hash buckets per order, greedy cascade (highest matching order wins), min_count=1
- **Entropy-adaptive alpha**: `0.20 + 0.55 * sigmoid(2 * (H - 3.0))` — per-token blending based on model uncertainty. High entropy trusts n-gram more.
- **Proper full-vocabulary mixture**: `p_final = (1 - alpha) * p_neural + alpha * p_ngram` — all tokens have nonzero probability
- **Causal sequential chunk eval**: process validation tokens in `seq_len`-sized chunks. For each chunk: (1) forward the model to get logits, (2) score all tokens using the mixer's current n-gram state, (3) AFTER scoring, update n-gram counts with this chunk's tokens. Strictly backward-looking.
- **KG-conditioned embedding init**: 358 token importance scores from a 500K-node knowledge graph bias embeddings toward semantically important concepts at initialization (zero runtime cost)
- **Swarm decision log**: 4 agents (QAT timing, KG weight, gradient health, MTP weight) make training decisions every 800 steps via consensus voting. Total overhead: <300 microseconds.

## Training Stack

- 11 layers, 512d, 8 heads, 4 KV heads, 3× MLP
- LeakyReLU(0.75)² activation
- Parallel Muon optimizer (momentum 0.99, warmup from 0.92)
- Multi-Token Prediction (2 heads, weight=0.1, discarded at export)
- EMA weight averaging (0.997)
- BigramHash (2048) + SmearGate
- XSA (last 4 layers) + Partial RoPE + LN Scale
- Int6 quantization (GPTQ-lite + LZMA)
- No TTT (eval budget used for causal n-gram scoring instead)

## Legality

1. **Causal n-gram cache**: counts built from already-scored tokens only. Each chunk is scored first, then its tokens are added to the count tables. The n-gram state at chunk C contains only tokens from chunks 0 through C-1.
2. **No validation data during training**: model trained on FineWeb training split only. KG embedding init uses offline-computed importance scores, not validation data.
3. **Alpha formula**: fixed function of model entropy, computed before seeing the target token. No hindsight selection.
4. **Committed distribution**: `(1 - alpha) * p_neural + alpha * p_ngram` — proper mixture, all tokens have nonzero probability.
5. **No external downloads or network calls during eval.**
6. **Reproducible**: all hyperparameters controlled via environment variables. Random seed controls all stochastic operations.

## Reproduction

```bash
LATE_QAT_THRESHOLD=0 TTT_ENABLED=0 KG_LOSS_WEIGHT=0.1 \
USE_NGRAM_MIXER=1 NGRAM_ORDER=10 NGRAM_BUCKETS=4194304 \
ALPHA_BASE=0.20 ALPHA_RANGE=0.55 ALPHA_CENTER=3.0 \
COMPLEMENT_ALPHA=0 NGRAM_MIN_COUNT=1 \
SEED=1337 \
torchrun --nproc_per_node=8 train_gpt.py
```

Requires `swarm_agents.py` and `kg_data.py` in the same directory.

## Credits & Acknowledgments

This submission builds directly on techniques from several prior PRs:

- **#803** (@pentxayc) — Complementary Training + BackoffNgramMixer architecture. Our mixer is adapted from their implementation. Our causal sequential eval differs from their approach.
- **#779** (@BackoffNgramMixer author) — Original BackoffNgramMixer, flat hash table design, entropy-adaptive alpha formula.
- **#549** (@sanjeevmadhav) — LeakyReLU² + Legal TTT + Parallel Muon base stack.
- **#414** (@signalrush) — 11L EMA + GPTQ-lite + warmdown base architecture.
- **#315** (@jfprincz) — Partial RoPE + LN Scale + XSA4.

The novel contributions are: (1) causal sequential chunk evaluation giving all ranks full 62M-token n-gram statistics, (2) swarm-guided training with transparent decision logging, (3) knowledge graph-conditioned embedding initialization.

## Files

| File | Size | Purpose | In artifact? |
|------|------|---------|-------------|
| `train_gpt.py` | 99KB | Training + causal eval | Yes (code bytes) |
| `swarm_agents.py` | 18KB | Agents + VotingMesh + BackoffNgramMixer | No (imported) |
| `kg_data.py` | 1KB | Compressed KG importance data | No (imported) |

## Test Plan

- [x] Seed 7: **0.3948** BPB, 15,940,706 bytes, eval 583s
- [x] Seed 1337: **0.3957** BPB, 15,943,009 bytes, eval 594s
- [x] Seed 2024: **0.3969** BPB, 15,957,577 bytes, eval 596s
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
"""Auto-generated KG importance data. Do not edit."""
KG_IMPORTANCE_B64 = "/Td6WFoAAATm1rRGAgAhARwAAAAQz1jM4AWZAptdADMAQN7VFifJpYP9M2+3RzRwlI23kNmAo30DBtfr3CUl2mbMFTqynLpKXMmUJl38JrUazafmN+ML1reirYszeABgzaMZKNapQNLOpnuhr+KnbuA6iEt+FzPb8UlXfnOMXTyWqZD4cAVo5hssRW/B0kA7c6JfgexAfopXlS2bP+/0JRDx5AFm+91YEJ/YtZ7bPOvDldkfhQslfTXgLJAO+VnMgwUlipppf8ippXc5ZbGNsx1xl+FBacfgF8AeCKqGOyyt3wTYCzyRGU324DwP8xy7uQxHr6WJqVWE3WJKvIJQLCphh53hsf1BrGgENqfim2urRxVGtVHpTdtaCN98BYcz6HIGwDB4jQJGZMnQbFxIQRrjrkjbYqJKlkWoGgSDw3SC89SaXUZzKBh9BkuwDXuJh8i7NL86+D+lZsKowB8dtnpl+1uZlJBzCESbZ8A1r62l72fzXlmunKEtzn2w+Tiq09+OIw7XNznLVBqM+KiEIUd3m/HPDfsB053ts+nFEWkWFtJAEO2DY8QWJlQSMFe9OTe5XkytPpBz4d9kWDjPe2RlkU0k5YWHTuyPVCk4s3Ogzf3B+DZtIKNnhgq2NM6wj00XJZDeWMyMxYOM9qYc5Age8ruwiuB1ZiaWC7UEpDOWOpnADxKjS4riNwk7fJx/yB6xwRNob1Gkjr7Xigf5ZW12sVexVW0ROfSCPRk8/xg3R3kik+8OfI4BzhTlIPFL6d0kQctznW5oryynRKqR7QiVKHQ7SrMJwTSA7dqsTm/8pFL2vJ51X9sxb0A/14eYw1VuVLe7knZyv7IE+KXI/hkhttG5YlBOQCq0uB5sDfhtWEeGfI+MKavNUpUiJrOpS7ipIkfhAtCK8PrMaGVKNqhi5GGG03LW7QkAAIxLLoBfvn1JAAG3BZoLAABBltLSscRn+wIAAAAABFla"
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
# Neural-only Ablation — Where the 0.3958 BPB comes from

This file decomposes the 0.3958 BPB submission into **(a) the trained neural
model** and **(b) the eval-time Causal BackoffNgramMixer**, using the exact
log lines from the three archived runs that produced `submission.json`.

**TL;DR:** the trained neural model by itself scores ~1.148 BPB. The same
model + `BackoffNgramMixer` at eval time scores 0.3958 BPB. The **~0.75
BPB improvement is entirely an eval-stage compression refinement**; no
training-objective change, no data leakage, no novel optimizer. This is
a direct descendant of already-merged #779 and #803.

## Per-seed ablation (from the archived run logs)

Source: `swarm_submission/run_final_seed{7,1337,2024}.log`, same runs that
populate `submission.json`.

| seed | post-EMA diagnostic<br>(neural, no quant, no mixer) | `final_int6_roundtrip`<br>(neural, int6 point eval) | `final_int6_sliding_window`<br>(neural + mixer, stride=64) |
|---|---|---|---|
| 7 | **1.1394** | **1.1481** | **0.3948** |
| 1337 | **1.1396** | **1.1480** | **0.3957** |
| 2024 | **1.1404** | **1.1492** | **0.3969** |
| **mean** | **1.1398** | **1.1484** | **0.3958** |

- `post-EMA diagnostic` = `train_gpt.py:1483` — the raw trained model's val_bpb on a standard non-sliding-window eval, taken immediately after EMA weight decay, before any quantization. This is the purest "neural only" number.
- `final_int6_roundtrip` = `train_gpt.py:1551` — same weights after int6 GPTQ-lite quantization + LZMA compression roundtrip, still no mixer, still point eval. ~0.009 BPB of quant noise vs the diagnostic.
- `final_int6_sliding_window` = `train_gpt.py:1577` — **same int6 weights**, sliding-window eval at stride=64, **with the mixer enabled**. No further training, no further weight changes.

**Mixer-attributed delta: 1.1484 − 0.3958 = 0.7526 BPB** (mean across seeds).

## Verbatim log excerpts

### seed 7 (`run_final_seed7.log`)
```
step:7024/20000 val_loss:1.9257 val_bpb:1.1405 train_time:600086ms step_avg:85.43ms
stopping_early: wallclock_cap train_time:600086ms step:7024/20000
DIAGNOSTIC post_ema val_loss:1.9239 val_bpb:1.1394 eval_time:1989ms
final_int6_roundtrip val_loss:1.9386 val_bpb:1.1481 eval_time:19276ms
final_int6_sliding_window val_loss:0.6667 val_bpb:0.3948 stride:64 eval_time:582774ms
final_int8_zlib_roundtrip_exact val_loss:0.66665722 val_bpb:0.39483300
```

### seed 1337 (`run_final_seed1337.log`)
```
DIAGNOSTIC post_ema val_loss:1.9241 val_bpb:1.1396 eval_time:1988ms
final_int6_roundtrip val_loss:1.9383 val_bpb:1.1480 eval_time:5946ms
final_int6_sliding_window val_loss:0.6681 val_bpb:0.3957 stride:64 eval_time:593857ms
final_int8_zlib_roundtrip_exact val_loss:0.66811451 val_bpb:0.39569610
```

### seed 2024 (`run_final_seed2024.log`)
```
DIAGNOSTIC post_ema val_loss:1.9254 val_bpb:1.1404 eval_time:2109ms
final_int6_roundtrip val_loss:1.9404 val_bpb:1.1492 eval_time:16040ms
final_int6_sliding_window val_loss:0.6701 val_bpb:0.3969 stride:64 eval_time:595814ms
final_int8_zlib_roundtrip_exact val_loss:0.67013029 val_bpb:0.39688996
```

## Mixer convergence curve (seed 7)

The mixer starts empty and accumulates n-gram counts in strict score-first
order as it walks the val stream. Running BPB across the eval (every ~128K
tokens of 969088 total):

| tokens scored | running bpb |
|---|---|
| 128 / 969088 | 1.175661 |
| 102528 / 969088 | 0.889010 |
| 230528 / 969088 | 0.643985 |
| 358528 / 969088 | 0.538056 |
| 486528 / 969088 | 0.483657 |
| 614528 / 969088 | 0.448113 |
| 742528 / 969088 | 0.423662 |
| 870528 / 969088 | 0.406234 |
| **969088 / 969088** | **0.394833** |

The first scored batch (128 tokens) is at 1.176 BPB — effectively the
neural-only floor since the mixer has no counts yet. As the mixer
accumulates counts from already-scored tokens, BPB drops monotonically
to 0.3948. **At no point does the mixer see a token before it is scored**
(see `train_gpt.py:876-935`, `eval_val_sliding` with mixer).

## Relationship to prior art

- **#779** — original `BackoffNgramMixer`, flat-hash design, entropy-adaptive alpha. Merged.
- **#803** — @pentxayc's Complementary Training + `BackoffNgramMixer` at 0.4416. Merged.
- **#1094 (this PR)** — same mixer family as #803, three orthogonal refinements:
1. Higher n-gram orders (2–10 vs 2–7)
2. 4.2M hash buckets per order (vs 1M)
3. Causal sequential chunk eval (score-first-per-batch, strictly backward-looking — `train_gpt.py:876-935`)

The 0.0458 improvement over #803 is an eval-stage refinement on top of a
legal, merged technique — not a new training method, not a new objective,
not a new dataset.

## Reproducibility

```bash
USE_NGRAM_MIXER=1 NGRAM_ORDER=10 NGRAM_BUCKETS=4194304 \
SEED=7 python train_gpt.py # expected: 0.3948 ± 0.001 BPB
SEED=1337 python train_gpt.py # expected: 0.3957 ± 0.001 BPB
SEED=2024 python train_gpt.py # expected: 0.3969 ± 0.001 BPB
```

3-seed mean 0.3958 BPB, std 0.0011, all under the 16 MB artifact cap
(15,943,009 / 15,940,706 / 15,957,577 bytes) and the 600 s eval cap
(583 / 594 / 596 s). See `submission.json`.
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
{
"author": "michaelwinczuk",
"github_id": "michaelwinczuk",
"val_bpb": 0.3958,
"val_bpb_std": 0.0011,
"seeds": {
"1337": 0.3957,
"7": 0.3948,
"2024": 0.3969
},
"artifact_bytes": {
"1337": 15943009,
"7": 15940706,
"2024": 15957577
},
"eval_time_seconds": {
"1337": 594,
"7": 583,
"2024": 596
},
"approach": "Causal BackoffNgramMixer with sliding-window eval",
"hardware": "8xH100 SXM"
}
Loading