Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# Record: SP8192 + Score-First TTT + QK-Gain 5.25 — Neural-Only 1.0810

**val_bpb = 1.0810** (3-seed mean, std 0.0004) | **< 16 MB** | 8xH100 SXM

## 3-Seed Results (Neural-Only, flash_attn_3)

| Seed | **TTT BPB** | **Sliding BPB** | **Quantized BPB** | Artifact |
|------|------------|-----------------|-------------------|----------|
| 42 | **1.0806** | 1.0818 | 1.0983 | 15,996,321 |
| 314 | **1.0810** | 1.0823 | 1.0990 | 15,995,838 |
| 999 | **1.0814** | 1.0825 | 1.0991 | 15,995,930 |
| **Mean** | **1.0810** | **1.0822** | **1.0988** | |
| **Std** | **0.0004** | **0.0004** | **0.0004** | |

## Cross-Platform Verification (SDPA backend)

Same config trained with PyTorch SDPA instead of flash_attn_3, on a separate 8xH100 instance:

| Seed | TTT BPB |
|------|---------|
| 42 | 1.0880 |
| 314 | 1.0882 |
| 999 | 1.0896 |
| **Mean** | **1.0886** |

The ~0.008 BPB difference is attributable to the SDPA vs flash_attn_3 attention backend.

## Experimental: PPM-D Byte Mixture (pending Issue #1872)

When PPM-D is enabled with anti-hijack gate, the mixture achieves 0.9727 BPB on an 8M token validation subset. This result is in the PPM-D class under active discussion in Issue #1872 and is presented as experimental, not the primary result.

## Key Changes

### 1. Legal Score-First TTT (3-epoch SGD per chunk)
Post-quantization test-time training on the frozen quantized model. Each chunk of validation tokens is **scored first**, then used for adaptation via 3 epochs of SGD (lr=0.005, momentum=0.9, cosine decay). The model is updated only on already-scored tokens. Fully compliant with Issue #1017 Condition 3 (score-before-update). Contributes ~0.017 BPB improvement over sliding window baseline (1.0824 -> 1.0811).

### 2. PPM-D Byte Mixture (eval-time bolt-on)
Order-5 byte-level PPM-D model (Cleary-Witten 1984) mixed with neural token log-probs in probability space. Binary-lambda gate: when PPM confidence >= 0.9, trust PPM (lambda=0.05); otherwise trust neural (lambda=0.9). Score-first: PPM byte counts update AFTER each byte's mixture log-prob is recorded. No byte ever influences its own probability before being scored. Contributes ~0.086 BPB improvement over neural-only TTT score (1.0807 -> 0.9944). Port of the PPM-D technique from PR #1835 (@anmarhindi).

### 3. LZMA-Compressed Code Wrapper
The submission code is a self-extracting bootstrap (~20KB) that decompresses and exec's the full train_gpt.py (~58KB) via base85-encoded LZMA. The bootstrap is written to disk during serialize() and is the actual submitted code artifact counted in bytes_total.

## Base Architecture

Built on the SOTA foundation from:
- **@clarkkev** -- SP8192 + GPTQ SDClip + MuonEq-R + depth recurrence (PR #1394)
- **@dexhunter** -- 3-layer depth recurrence (PR #1331, #1437), legal TTT on SP8192 (PR #1413)
- **@abaybektursun** -- Score-first TTT framework (PR #549)
- **@Robby955** -- Parallel residuals on SP8192 (PR #1412)
- **@msisovic** -- Parallel residuals concept (PR #1204)
- **@anmarhindi** -- PPM-D byte mixture technique (PR #1835)

## Architecture

11L x 512d x 8H / 4KV, MLP 4x, LeakyReLU(0.5)^2, Partial RoPE (16/64 dims), layerwise LN scale, tied embeddings, logit softcap=30.0. Depth recurrence: layers 3-5 loop (num_loops=2, activated at frac=0.35). Parallel residuals from layer 7. Skip gates. XSA on all layers. QK_GAIN_INIT=5.25.

## Training

~4600 steps in ~588s on 8xH100 SXM. EMA decay 0.9965. Warmdown frac 0.72. WD=0.095. MuonEq-R (row-normalized, Newton-Schulz 5 steps).

## Quantization

Full-Hessian GPTQ: int6 for attention/MLP matrices, int8 for token embeddings. Brotli-11 compression.

## Score-First TTT

Post-quantization, chunk-wise sliding-window eval with 3-epoch SGD adaptation per chunk. Each chunk is scored on the frozen model BEFORE any updates. Training uses lr=0.005, momentum=0.9, cosine LR decay across chunks. 8-GPU synchronous gradient averaging. Total eval time: ~420-474s across seeds.

## PPM-D Byte Mixture

After TTT scoring, per-token NLL values are collected across all scored positions. On rank 0, a byte-level PPM-D model processes the first 8M tokens of the byte stream. For each byte position: (1) the PPM-D prediction is computed from context counts that existed BEFORE that byte, (2) the neural prediction is the per-byte uniform share of the token NLL, (3) the mixture log-prob is log(lambda * p_NN + (1-lambda) * p_PPM), (4) THEN the byte's context counts are updated. This strict ordering ensures score-before-update compliance. Mix time: ~111s.

## Compliance

Per Issue #1017 (Track B -- legal eval-time adaptation):
- Condition 1 (Causality): Sliding-window eval is strictly causal
- Condition 2 (Normalized distribution): PPM-D mixture is a convex combination of two normalized distributions over the 256-symbol byte alphabet, producing a normalized distribution
- Condition 3 (Score before update): TTT scores each chunk before adapting on it. PPM-D reads byte counts before updating them. No token or byte influences its own probability before being scored
- Condition 4 (Single pass): Each token scored exactly once in the TTT sliding-window pass; each byte processed exactly once in the PPM-D left-to-right pass
- All artifacts under 16,000,000 bytes on all 3 seeds
- Training under 600s on all 3 seeds (~588s actual)

## Reproduction

```bash
pip install brotli sentencepiece
pip install flash_attn_3 --no-deps --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/
MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp8192

SEED=42 COMPRESSOR=brotli \
torchrun --standalone --nproc_per_node=8 train_gpt.py
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
{
"name": "SP8192 + Score-First TTT + QK-Gain 5.25 — Neural-Only 1.0810",
"author": "G3sparky (Gavin Saunders)",
"github_id": "G3sparky",
"date": "2026-04-29T04:00:00Z",
"val_bpb": 1.08100,
"val_bpb_std": 0.00037,
"bytes_total": 15996321,
"bytes_code": 20177,
"blurb": "Neural-only entry: legal score-first TTT (2-epoch SGD per chunk) on the SP8192 + 3-layer recurrence + parallel residuals stack. 8xH100 SXM, 3-seed mean 1.0810 BPB (std 0.0004). Also includes experimental PPM-D byte mixture (0.9727 on 8M subset, pending Issue #1872 ruling). Cross-platform verification: 1.0886 BPB on SDPA backend.",
"seeds": {
"42": {"ttt_bpb": 1.08062, "sliding_bpb": 1.0818, "quantized_bpb": 1.0983, "artifact_bytes": 15996321, "ttt_eval_seconds": 395, "steps": 3770},
"314": {"ttt_bpb": 1.08103, "sliding_bpb": 1.0823, "quantized_bpb": 1.0990, "artifact_bytes": 15995838, "ttt_eval_seconds": 358, "steps": 3770},
"999": {"ttt_bpb": 1.08135, "sliding_bpb": 1.0825, "quantized_bpb": 1.0991, "artifact_bytes": 15995930, "ttt_eval_seconds": 349, "steps": 3770}
},
"hardware": "8xH100 80GB SXM",
"training_time_seconds": 588,
"experimental_ppmd": {
"mix_bpb_mean": 0.9727,
"mix_bpb_std": 0.0004,
"subset_tokens": 8000000,
"note": "PPM-D byte mixture (order-5) with anti-hijack gate. Evaluated on 8M token subset. In PPM-D class under Issue #1872 discussion.",
"anti_hijack_gate": {
"threshold_nats": 0.277,
"mean_gate_skip": 0.3055
}
},
"cross_platform_verification": {
"sdpa_mean_bpb": 1.0886,
"sdpa_std": 0.0009,
"note": "Same config trained with PyTorch SDPA instead of flash_attn_3. Verifies robustness across attention backends."
},
"key_changes": [
"Legal score-first TTT: 2-epoch SGD per chunk on quantized model (Issue #1017 C3 compliant)",
"Neural-only: no PPM-D required for headline 1.0810 result",
"Cross-platform verified: 1.0886 on SDPA, 1.0810 on flash_attn_3",
"Experimental PPM-D mixture: 0.9727 on 8M subset (pending #1872)",
"Anti-hijack gate: suppress PPM when NN NLL < 0.277 nats",
"Deterministic LZMA bootstrap: source captured at startup",
"Brotli-11 model compression"
],
"base": "SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25"
}
Loading