openai · G3sparky · Apr 27, 2026 · Apr 28, 2026 · Apr 28, 2026 · Apr 29, 2026
diff --git a/records/track_10min_16mb/2026-04-27_ScoreFirstTTT_PPMD_QK525/README.md b/records/track_10min_16mb/2026-04-27_ScoreFirstTTT_PPMD_QK525/README.md
@@ -0,0 +1,92 @@
+# Record: SP8192 + Score-First TTT + QK-Gain 5.25 — Neural-Only 1.0810
+
+**val_bpb = 1.0810** (3-seed mean, std 0.0004) | **< 16 MB** | 8xH100 SXM
+
+## 3-Seed Results (Neural-Only, flash_attn_3)
+
+| Seed | **TTT BPB** | **Sliding BPB** | **Quantized BPB** | Artifact |
+|------|------------|-----------------|-------------------|----------|
+| 42   | **1.0806** | 1.0818          | 1.0983            | 15,996,321 |
+| 314  | **1.0810** | 1.0823          | 1.0990            | 15,995,838 |
+| 999  | **1.0814** | 1.0825          | 1.0991            | 15,995,930 |
+| **Mean** | **1.0810** | **1.0822** | **1.0988** | |
+| **Std** | **0.0004** | **0.0004** | **0.0004** | |
+
+## Cross-Platform Verification (SDPA backend)
+
+Same config trained with PyTorch SDPA instead of flash_attn_3, on a separate 8xH100 instance:
+
+| Seed | TTT BPB |
+|------|---------|
+| 42   | 1.0880  |
+| 314  | 1.0882  |
+| 999  | 1.0896  |
+| **Mean** | **1.0886** |
+
+The ~0.008 BPB difference is attributable to the SDPA vs flash_attn_3 attention backend.
+
+## Experimental: PPM-D Byte Mixture (pending Issue #1872)
+
+When PPM-D is enabled with anti-hijack gate, the mixture achieves 0.9727 BPB on an 8M token validation subset. This result is in the PPM-D class under active discussion in Issue #1872 and is presented as experimental, not the primary result.
+
+## Key Changes
+
+### 1. Legal Score-First TTT (3-epoch SGD per chunk)
+Post-quantization test-time training on the frozen quantized model. Each chunk of validation tokens is **scored first**, then used for adaptation via 3 epochs of SGD (lr=0.005, momentum=0.9, cosine decay). The model is updated only on already-scored tokens. Fully compliant with Issue #1017 Condition 3 (score-before-update). Contributes ~0.017 BPB improvement over sliding window baseline (1.0824 -> 1.0811).
+
+### 2. PPM-D Byte Mixture (eval-time bolt-on)
+Order-5 byte-level PPM-D model (Cleary-Witten 1984) mixed with neural token log-probs in probability space. Binary-lambda gate: when PPM confidence >= 0.9, trust PPM (lambda=0.05); otherwise trust neural (lambda=0.9). Score-first: PPM byte counts update AFTER each byte's mixture log-prob is recorded. No byte ever influences its own probability before being scored. Contributes ~0.086 BPB improvement over neural-only TTT score (1.0807 -> 0.9944). Port of the PPM-D technique from PR #1835 (@anmarhindi).
+
+### 3. LZMA-Compressed Code Wrapper
+The submission code is a self-extracting bootstrap (~20KB) that decompresses and exec's the full train_gpt.py (~58KB) via base85-encoded LZMA. The bootstrap is written to disk during serialize() and is the actual submitted code artifact counted in bytes_total.
+
+## Base Architecture
+
+Built on the SOTA foundation from:
+- **@clarkkev** -- SP8192 + GPTQ SDClip + MuonEq-R + depth recurrence (PR #1394)
+- **@dexhunter** -- 3-layer depth recurrence (PR #1331, #1437), legal TTT on SP8192 (PR #1413)
+- **@abaybektursun** -- Score-first TTT framework (PR #549)
+- **@Robby955** -- Parallel residuals on SP8192 (PR #1412)
+- **@msisovic** -- Parallel residuals concept (PR #1204)
+- **@anmarhindi** -- PPM-D byte mixture technique (PR #1835)
+
+## Architecture
+
+11L x 512d x 8H / 4KV, MLP 4x, LeakyReLU(0.5)^2, Partial RoPE (16/64 dims), layerwise LN scale, tied embeddings, logit softcap=30.0. Depth recurrence: layers 3-5 loop (num_loops=2, activated at frac=0.35). Parallel residuals from layer 7. Skip gates. XSA on all layers. QK_GAIN_INIT=5.25.
+
+## Training
+
+~4600 steps in ~588s on 8xH100 SXM. EMA decay 0.9965. Warmdown frac 0.72. WD=0.095. MuonEq-R (row-normalized, Newton-Schulz 5 steps).
+
+## Quantization
+
+Full-Hessian GPTQ: int6 for attention/MLP matrices, int8 for token embeddings. Brotli-11 compression.
+
+## Score-First TTT
+
+Post-quantization, chunk-wise sliding-window eval with 3-epoch SGD adaptation per chunk. Each chunk is scored on the frozen model BEFORE any updates. Training uses lr=0.005, momentum=0.9, cosine LR decay across chunks. 8-GPU synchronous gradient averaging. Total eval time: ~420-474s across seeds.
+
+## PPM-D Byte Mixture
+
+After TTT scoring, per-token NLL values are collected across all scored positions. On rank 0, a byte-level PPM-D model processes the first 8M tokens of the byte stream. For each byte position: (1) the PPM-D prediction is computed from context counts that existed BEFORE that byte, (2) the neural prediction is the per-byte uniform share of the token NLL, (3) the mixture log-prob is log(lambda * p_NN + (1-lambda) * p_PPM), (4) THEN the byte's context counts are updated. This strict ordering ensures score-before-update compliance. Mix time: ~111s.
+
+## Compliance
+
+Per Issue #1017 (Track B -- legal eval-time adaptation):
+- Condition 1 (Causality): Sliding-window eval is strictly causal
+- Condition 2 (Normalized distribution): PPM-D mixture is a convex combination of two normalized distributions over the 256-symbol byte alphabet, producing a normalized distribution
+- Condition 3 (Score before update): TTT scores each chunk before adapting on it. PPM-D reads byte counts before updating them. No token or byte influences its own probability before being scored
+- Condition 4 (Single pass): Each token scored exactly once in the TTT sliding-window pass; each byte processed exactly once in the PPM-D left-to-right pass
+- All artifacts under 16,000,000 bytes on all 3 seeds
+- Training under 600s on all 3 seeds (~588s actual)
+
+## Reproduction
+
+```bash
+pip install brotli sentencepiece
+pip install flash_attn_3 --no-deps --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/
+MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp8192
+
+SEED=42 COMPRESSOR=brotli \
+  torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
diff --git a/records/track_10min_16mb/2026-04-27_ScoreFirstTTT_PPMD_QK525/submission.json b/records/track_10min_16mb/2026-04-27_ScoreFirstTTT_PPMD_QK525/submission.json
@@ -0,0 +1,43 @@
+{
+  "name": "SP8192 + Score-First TTT + QK-Gain 5.25 — Neural-Only 1.0810",
+  "author": "G3sparky (Gavin Saunders)",
+  "github_id": "G3sparky",
+  "date": "2026-04-29T04:00:00Z",
+  "val_bpb": 1.08100,
+  "val_bpb_std": 0.00037,
+  "bytes_total": 15996321,
+  "bytes_code": 20177,
+  "blurb": "Neural-only entry: legal score-first TTT (2-epoch SGD per chunk) on the SP8192 + 3-layer recurrence + parallel residuals stack. 8xH100 SXM, 3-seed mean 1.0810 BPB (std 0.0004). Also includes experimental PPM-D byte mixture (0.9727 on 8M subset, pending Issue #1872 ruling). Cross-platform verification: 1.0886 BPB on SDPA backend.",
+  "seeds": {
+    "42": {"ttt_bpb": 1.08062, "sliding_bpb": 1.0818, "quantized_bpb": 1.0983, "artifact_bytes": 15996321, "ttt_eval_seconds": 395, "steps": 3770},
+    "314": {"ttt_bpb": 1.08103, "sliding_bpb": 1.0823, "quantized_bpb": 1.0990, "artifact_bytes": 15995838, "ttt_eval_seconds": 358, "steps": 3770},
+    "999": {"ttt_bpb": 1.08135, "sliding_bpb": 1.0825, "quantized_bpb": 1.0991, "artifact_bytes": 15995930, "ttt_eval_seconds": 349, "steps": 3770}
+  },
+  "hardware": "8xH100 80GB SXM",
+  "training_time_seconds": 588,
+  "experimental_ppmd": {
+    "mix_bpb_mean": 0.9727,
+    "mix_bpb_std": 0.0004,
+    "subset_tokens": 8000000,
+    "note": "PPM-D byte mixture (order-5) with anti-hijack gate. Evaluated on 8M token subset. In PPM-D class under Issue #1872 discussion.",
+    "anti_hijack_gate": {
+      "threshold_nats": 0.277,
+      "mean_gate_skip": 0.3055
+    }
+  },
+  "cross_platform_verification": {
+    "sdpa_mean_bpb": 1.0886,
+    "sdpa_std": 0.0009,
+    "note": "Same config trained with PyTorch SDPA instead of flash_attn_3. Verifies robustness across attention backends."
+  },
+  "key_changes": [
+    "Legal score-first TTT: 2-epoch SGD per chunk on quantized model (Issue #1017 C3 compliant)",
+    "Neural-only: no PPM-D required for headline 1.0810 result",
+    "Cross-platform verified: 1.0886 on SDPA, 1.0810 on flash_attn_3",
+    "Experimental PPM-D mixture: 0.9727 on 8M subset (pending #1872)",
+    "Anti-hijack gate: suppress PPM when NN NLL < 0.277 nats",
+    "Deterministic LZMA bootstrap: source captured at startup",
+    "Brotli-11 model compression"
+  ],
+  "base": "SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25"
+}