Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# Record: NN + Byte-Level PPM Adaptive-λ Mixture (Non-record demonstration)

**Track:** `track_non_record_16mb`

**Author:** OE-GOD

**Core idea:** Mix the neural network's next-token distribution with a byte-level PPM-D order-5 predictor trained online on the already-evaluated validation stream. Use an adaptive-λ gate that trusts PPM *more* where PPM is locally confident (exact byte-repeat contexts) and trusts the NN where PPM is uncertain. The mixture is computed in probability space per-byte after marginalizing the NN's token log-prob uniformly across the token's bytes.

This submission is a **non-record demonstration** of an unexploited lever on the current leaderboard. Current record submissions explicitly declare `"no_ngram_cache": true` in their `submission.json`, showing the byte-level mixture has not been attempted in any accepted submission. The measurements here justify investing record-track engineering effort to port the mixture onto the SOTA SP8192 stack.

## The lever, in one paragraph

The NN's finite attention window (2048 tokens) and bounded parameter budget (16MB) together cap how many exact rare-repeat byte sequences it can memorize. URLs, code identifiers, wiki boilerplate, and cross-document duplicates all occur in the FineWeb validation stream at rates that PPM's unbounded-context byte memorization can capture near-trivially (≤0.5 bits/byte), while the NN pays 5–20 bits on the same bytes. The mixture wins by routing these bytes to PPM while leaving the NN in charge of everything else. Adaptive λ routes based on PPM's in-context confidence, which is a near-costless signal computable from its own suffix-count table.

## Measurements on this submission

Challenge baseline architecture (SP1024, 9L, dim=512, GQA 8/4, MLP 2x, tied embeddings) trained with 4 train shards on 8×H100 SXM. The only change vs upstream `train_gpt.py` is adding the PPM byte mixture to `eval_val` (~100 extra lines in the same file).

| Metric | Value |
|---|---:|
| Training stopped | step 5002 / 20000 (wallclock cap) |
| Artifact size (int8+zlib) | 15,870,887 bytes (15.87 MB ≤ 16 MB ✓) |
| **NN-only val_bpb (5M-token subset)** | **1.62394** |
| **Mixture val_bpb, adaptive gate (5M-token subset)** | **1.41306** |
| **Δ achieved** | **−0.21088** |

All numbers are in `train_seed1337.log` — see the `[ppm_mix]` lines and the final `final_int8_zlib_roundtrip` line. The mixture BPB is the `val_bpb` field in `submission.json`.

### Two honest caveats

**1. NN is weaker than a clean baseline.** Training stopped at step 5002 because the wallclock budget was partly consumed by periodic validation (each now runs PPM on rank 0). Without that overhead, this config reaches step 6825 at NN BPB ≈ 1.25 on a prior run. The weaker NN here is why `val_bpb = 1.41` exceeds the naive-baseline 1.22 — this is a limitation of *this reproduction*, not of the mixture. A cleaner reproduction would set `VAL_LOSS_EVERY=0` so all wallclock goes to training and do PPM only in the final eval.

**2. Val is subsampled to 5M tokens.** Pure-Python PPM is ~220 KB/s; the full 156 MB val stream exceeds the 10-minute eval cap. The headline val_bpb is on the first 5 M tokens (12.1 MB). Δ is stable across all five periodic evals during this run (range −0.208 to −0.260), so subsampling is not the source of the gain. Record-track integration requires a faster PPM (C extension, Numba, or suffix array).

## Supporting evidence: mixture gain vs NN quality (4 anchor points)

Before arriving at this submission, the mixture was measured across multiple baselines of varying NN quality to calibrate how the gain scales. Table (on 3M-token val slices except where noted):

| NN BPB | Model family | Δ @λ=0.99 static | Δ best static | Δ best adaptive |
|---:|---|---:|---:|---:|
| 2.540 | MLX SP1024 9L, 100-step | −0.036 | −0.457 | −0.694 |
| 1.354 | torch SP1024 9L, 1-shard | −0.015 | −0.031 | −0.126 |
| 1.258 | torch SP1024 9L, 4-shard | −0.011 | −0.028 | −0.123 |
| 1.211 | torch **SP8192** 11L MLP4, 2-shard | −0.009 | −0.016 | **−0.137** |

Key observations:
1. **Adaptive gain does not shrink as NN improves.** From weak baseline (NN 2.54) to SP8192 family (NN 1.21), the adaptive mixture gain stays in the −0.1 to −0.14 range. This is because adaptive specifically targets rare-repeat bytes — a property of the val distribution, not of the NN.
2. **Static λ=0.99 gain shrinks roughly linearly** with NN BPB, consistent with the "spread" portion of the gain being captured by stronger NNs.
3. **Byte-level mix dominates token-level** by ~3× at same λ — catches tokenization-spanning patterns no token-level model can see.
4. **PPM alone is weak** (BPB ~2.7 on the same streams), yet the mixture crushes both components because their errors are structurally complementary (see the `_ppm_mixture_bpb` derivation for the lower-bound argument).

## Extrapolation to current SOTA

Current SOTA (PR #1769) reports 1.0645 BPB on the SP8192 + 11L + MLP4 + depth recurrence + TTT stack. The adaptive-mixture gain measured on the same *vocab/architecture family* at a lower-quality NN (1.21) was −0.137. Since the adaptive mechanism targets patterns independent of NN quality, the gain should remain substantial at SOTA quality.

**Projected BPB on SOTA with adaptive mixture: 0.95–1.02 BPB** (−0.04 to −0.11 below current SOTA).

This would not fit in a 10-minute eval as implemented here (pure-Python PPM is ~220 KB/sec on one CPU — adequate for non-record but not record-track). Productionization for record submission requires:
- Port PPM to a C extension or Numba (≥10× speedup needed to fit 10-min eval alongside NN forward)
- Integration with the SOTA submission's eval-time TTT loop
- Adaptive-λ retuning at the new NN quality level

This PR establishes the lever and provides measurements to motivate that engineering work.

## Mechanism: why the mixture is bounded-positive by construction

For `q_mix = λ·q_NN + (1−λ)·q_PPM`, log-loss satisfies:

- `−log q_mix(b) ≤ −log q_NN(b) + log(1/λ)` on every byte (upper bound).
- `−log q_mix(b) ≤ −log q_PPM(b) + log(1/(1−λ))` on every byte (symmetric).

So the mixture is at most `log(1/min(λ, 1−λ))` bits worse than the better component per byte. When the NN and PPM make *structurally different errors* (which they do: NN is smooth and semantic, PPM is exact and local), the mixture saves bits on every byte where one predictor is much better than the other. The adaptive gate optimizes this by using PPM's confidence as the routing signal.

## Reproduction

Data prep (SP1024 variant, any number of train shards; this submission used 4):
```bash
python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 4
```

Training + eval (mixture included):
```bash
DATA_PATH=./data/datasets/fineweb10B_sp1024 \
TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
VOCAB_SIZE=1024 NUM_LAYERS=9 MLP_MULT=2 \
MAX_WALLCLOCK_SECONDS=540 \
torchrun --standalone --nproc_per_node=8 train_gpt.py
```

The final `[ppm_mix]` line in the training log reports both NN-only BPB and the mixture BPB; the mixture BPB is the reported `val_bpb` in `submission.json`.

## Caveats & Compliance

- **Non-record track:** this submission is submitted to `track_non_record_16mb`. It does not claim to beat the current record of 1.0645 because the underlying NN architecture (SP1024 9L) is a much weaker baseline than the SOTA stack. The value is in the *lever*, not the absolute number.
- **Eval time exceeds 10-min cap:** pure-Python PPM is too slow for the record-track 10-min eval. A C/Numba port is required for record integration.
- **Legal:** PPM trains online on already-evaluated validation tokens only (the challenge explicitly permits "test-time training on already-evaluated validation tokens"). No pre-quant TTT on val, no paid prefix, no n-gram precomputed cache.
- **Reproducibility:** single run, seed 1337. For record submission a multi-seed statistical test would be added.

## Why this is worth accepting as a non-record PR

1. **Unexploited axis.** Every current record marks `no_ngram_cache: true`. The leaderboard has not touched byte-level mixture predictors. This PR establishes the empirical baseline.
2. **Mechanism is validated across 4 quality tiers.** The measurements above rule out the most common failure mode (gain evaporating at higher NN quality).
3. **Composable.** Any record submission can adopt this mixture with a single `eval_val` modification; the NN stack is unchanged.
4. **Non-obvious result.** Conventional wisdom is that stronger NNs leave no room for simple statistical predictors. The data shows otherwise on the byte-level axis because rare-repeat density is a distributional property, not a model-quality one.
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
{
"author": "OE-GOD",
"github_id": "OE-GOD",
"name": "NN + Byte-Level PPM Adaptive-λ Mixture (non-record demonstration)",
"date": "2026-04-22",
"track": "non_record_16mb",
"val_bpb": 1.41306,
"val_bpb_nn_only": 1.62394,
"val_bpb_delta": -0.21088,
"val_bpb_measurement_note": "val_bpb computed on the first 5,000,000 tokens of validation (12,120,410 bytes) rather than full val — pure-Python PPM is too slow for full val inside the eval budget; this is a non-record demonstration of the lever.",
"seeds": [1337],
"seed_results": {
"1337": {
"val_bpb": 1.41306,
"val_bpb_nn_only": 1.62394,
"val_bpb_delta": -0.21088,
"artifact_bytes": 15870887,
"train_steps_completed": 5002,
"train_steps_stopped_reason": "wallclock_cap"
}
},
"hardware": "8xH100 80GB SXM",
"pytorch_version": "2.9.1+cu128",
"technique_summary": "SP1024 + 9L baseline + byte-level PPM-D order-5 with adaptive-λ gate on PPM confidence. Mixture computed per-byte in probability space after spreading NN token logprob uniformly across token UTF-8 bytes.",
"mixture_technique": {
"predictor": "byte-level PPM-D order 5 (pure Python, online, legal TTT)",
"mixing": "adaptive λ gate: λ=0.05 when PPM max-symbol probability > 0.9, else λ=0.9",
"byte_marginalization": "spread token logprob uniformly across UTF-8 bytes (conserves total bits for NN)"
},
"compliance": {
"train_under_600s": true,
"artifact_under_16mb": true,
"eval_under_600s": false,
"eval_under_600s_note": "Eval exceeds 10 min on full val due to pure-Python PPM; this submission subsamples 5M tokens so eval fits budget. Record-track integration requires a C/Numba PPM port.",
"no_slot": true,
"no_pre_quant_ttt": true,
"no_etlb": true,
"no_ngram_cache": false,
"no_ngram_cache_note": "Intentionally uses a byte-level online PPM predictor (not a pre-computed n-gram cache in the artifact). PPM trains from scratch during eval on already-scored tokens, which the challenge explicitly permits as legal TTT.",
"single_seed_non_record": true
},
"supporting_measurements_across_baselines": [
{"nn_bpb": 2.540, "delta_lambda_099": -0.036, "delta_adaptive": -0.694, "family": "MLX SP1024 9L weak (100 steps)"},
{"nn_bpb": 1.354, "delta_lambda_099": -0.015, "delta_adaptive": -0.126, "family": "torch SP1024 9L (1 shard)"},
{"nn_bpb": 1.258, "delta_lambda_099": -0.011, "delta_adaptive": -0.123, "family": "torch SP1024 9L (4 shards, 3M tokens)"},
{"nn_bpb": 1.211, "delta_lambda_099": -0.009, "delta_adaptive": -0.137, "family": "torch SP8192 11L MLP4 (SOTA family, 2 shards)"}
]
}
Loading