openai · jpfeiffe · Apr 9, 2026 · Apr 9, 2026 · akhoyannh-a11y · Apr 10, 2026
diff --git a/records/track_10min_16mb/2026-04-08_SP4096_WD2_brotli_jpfeiffe/README.md b/records/track_10min_16mb/2026-04-08_SP4096_WD2_brotli_jpfeiffe/README.md
@@ -0,0 +1,93 @@
+# Record: SP4096 + Compressibility Regularization
+
+**val_bpb: 1.11349** (6-seed mean, std 0.00053) | **~15.68 MB** | 8xH100 SXM, 600s | No TTT
+
+## Results
+
+| Seed | Steps | ms/step | Pre-quant BPB | **Sliding BPB** | Artifact | Pruning |
+|------|-------|---------|---------------|-----------------|----------|---------|
+| 314 | 6,699 | 89 | 1.1260 | **1.11410** | 15,665,083 | 0% |
+| 42 | 6,664 | 90 | 1.1261 | **1.11418** | 15,667,940 | 0% |
+| 999 | 6,659 | 90 | 1.1255 | **1.11348** | 15,697,830 | 0% |
+| 1337 | 6,658 | 90 | 1.1253 | **1.11307** | 15,660,616 | 0% |
+| 2024 | 6,664 | 90 | 1.1261 | **1.11306** | 15,693,397 | 0% |
+| 7 | 6,659 | 90 | 1.1255 | **1.11305** | 15,686,495 | 0% |
+| **Mean** | | | | **1.11349** | | |
+
+Exact 6-seed mean: **1.11348911 BPB**. Current merged SOTA (PR #1019) exact 3-seed mean: **1.11473509 BPB**. Welch's t-test: **t = -4.19**, **df = 6.6**, **p = 0.00289** (one-sided).
+
+No TTT, no n-gram cache, no eval-time logit bias. All gains are from training-side changes.
+
+---
+
+## Changes
+
+Three changes to the PR #1019 base:
+
+### 1. SP4096 Tokenizer
+
+Vocabulary increased from SP1024 to SP4096. Tokens-per-byte drops from ~0.59 to ~0.30, allowing the model to see more context per training step. The tied embedding grows from 1024x512 to 4096x512, adding ~1.1MB to the artifact.
+
+SP4096 data from [sproos/parameter-golf-tokenizers](https://huggingface.co/sproos/parameter-golf-tokenizers), tokenized from the same FineWeb documents as the official SP1024 data (identical `docs_sha256`; see `data_lineage.md`).
+
+### 2. WARMDOWN_WD_MULT=2.0
+
+During LR warmdown, effective weight decay increases from 1x to 2x base WD. The mechanism: `group["weight_decay"] = base_wd * (1 + (mult - 1) * (1 - lr_scale))`, applied to all optimizer param groups before each step. Muon and AdamW both consume the updated WD via their standard `p.data.mul_(1.0 - lr * wd)` path.
+
+This produces a more peaked post-quantization weight distribution (entropy 4.72 → 4.58 bits, zeros 8.3% → 11.4%), reducing brotli-compressed artifact size by ~1.5MB.
+
+### 3. Brotli-11 Compression
+
+Both lzma-9 and brotli-11 are computed; the smaller result is saved as the artifact. Brotli-11 was smaller on all 6 seeds. The load path auto-detects format (try lzma first, fall back to brotli).
+
+### Why These Three Stack
+
+WARMDOWN_WD_MULT=2.0 frees ~1.5MB of artifact budget through compression. This headroom absorbs SP4096's +1.1MB embedding cost. All 6 seeds fit under 16MB without selective pruning (0% on all seeds).
+
+Without WARMDOWN_WD_MULT, SP4096 requires aggressive selective pruning (57.5% of +/-1 values zeroed) which destroys quality (SW BPB degrades from 1.113 to 1.136).
+
+---
+
+## Architecture
+
+| Component | Setting |
+|-----------|---------|
+| Layers | 11 (512d, 8 GQA heads, 4 KV heads) |
+| MLP | 3x (1536) with LeakyReLU(0.5)^2 |
+| Attention | XSA on all 11 layers |
+| BigramHash | 3072 x dim=112 |
+| Tokenizer | **SP4096** |
+| Quantization | INT6 per-row, GPTQ with AR self-gen calibration |
+| Compression | **Brotli-11 selected when smaller than LZMA-9** |
+| Weight Decay | **WARMDOWN_WD_MULT=2.0** (ramps from 1x to 2x during warmdown) |
+| WARMDOWN_ITERS | 4000 |
+
+---
+
+## Verification
+
+- Manual BPB recompute matches logged value to 4e-6 (`bpb_verification.md`)
+- SP4096 tokenized from same FineWeb documents as SP1024 baseline; `docs_sha256` identical (`data_lineage.md`)
+
+---
+
+## Reproduction
+
+```bash
+# Download SP4096 data
+python3 -c "
+from huggingface_hub import snapshot_download
+snapshot_download('sproos/parameter-golf-tokenizers',
+    allow_patterns=['datasets/fineweb10B_sp4096/*', 'tokenizers/fineweb_4096_bpe.*'],
+    local_dir='./data')
+"
+
+# Run (8xH100 SXM)
+VOCAB_SIZE=4096 \
+DATA_PATH=./data/datasets/fineweb10B_sp4096 \
+TOKENIZER_PATH=./data/tokenizers/fineweb_4096_bpe.model \
+BIGRAM_VOCAB_SIZE=3072 BIGRAM_DIM=112 \
+WARMDOWN_ITERS=4000 WARMDOWN_WD_MULT=2.0 \
+SEED=314 \
+torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
diff --git a/records/track_10min_16mb/2026-04-08_SP4096_WD2_brotli_jpfeiffe/bpb_verification.md b/records/track_10min_16mb/2026-04-08_SP4096_WD2_brotli_jpfeiffe/bpb_verification.md
@@ -0,0 +1,28 @@
+# BPB Verification
-# BPB Verification
+# BPB Verification[]()
-# BPB Verification
+# BPB Verification[]()
+
+Manual byte-counting sanity check on the SP4096 validation shard.
+
+## Method
+
+1. Load SP4096 tokenizer (`fineweb_4096_bpe.model`)
+2. Load val shard (`fineweb_val_000000.bin`): 44,848,122 tokens
+3. For each target token, compute UTF-8 byte count using the same `build_sentencepiece_luts` logic as `train_gpt.py`
+4. Compute BPB = (val_loss / ln2) * (tokens / bytes)
+
+## Results
+
+```
+Val tokens:        44,848,122
+Target tokens:     44,848,121
+Total UTF-8 bytes: 150,755,442
+Tokens per byte:   0.29748923
+
+For seed 7 (val_loss = 2.59339416):
+  Manual BPB:   1.11304910
+  Reported BPB: 1.11304546
+  Difference:   0.0000036 (float64 accumulation order)
+```
+
+## Conclusion
+
+Manual computation matches reported BPB within float64 precision (3.6e-6 difference from accumulation order in sequential loop vs batched GPU computation). The BPB calculation is correct.
diff --git a/records/track_10min_16mb/2026-04-08_SP4096_WD2_brotli_jpfeiffe/data_lineage.md b/records/track_10min_16mb/2026-04-08_SP4096_WD2_brotli_jpfeiffe/data_lineage.md
@@ -0,0 +1,34 @@
+# Data Lineage Verification
+
+SP4096 data from [sproos/parameter-golf-tokenizers](https://huggingface.co/sproos/parameter-golf-tokenizers) is tokenized from the **same FineWeb documents** as the official SP1024 data in [willdepueoai/parameter-golf](https://huggingface.co/datasets/willdepueoai/parameter-golf).
+
+## Cryptographic Hash Match
+
+`docs_selected.source_manifest.json` is **byte-for-byte identical** in both repos:
+
+| Field | Official (willdepueoai) | Sproos |
+|-------|------------------------|--------|
+| `docs_sha256` | `84386dfa7b339a...d19bc7` | `84386dfa7b339a...d19bc7` |
+| `num_docs` | 15,368,808 | 15,368,808 |
+| `docs_val` | 50,000 | 50,000 |
+| `docs_train` | 15,318,808 | 15,318,808 |
+| `docs_bytes` | 48,166,275,520 | 48,166,275,520 |
+| `selection_seed` | 1337 | 1337 |
+
+## Val Token Counts
+
+Same 50,000 documents, different tokenizations:
+
+| Tokenizer | Val Tokens | Val Bytes |
+|-----------|-----------|-----------|
+| Official SP1024 | 62,021,846 | ~151M |
+| Sproos SP4096 | 44,847,738 | ~151M |
+
+Byte count is identical (same UTF-8 text). Token count differs because SP4096 has larger vocabulary → fewer tokens per byte.
+
+## Lineage Chain
+
+1. Official FineWeb 50k eval docs selected with `selection_seed=1337`
+2. Documents hashed: `docs_sha256 = 84386dfa7b339a...d19bc7`
+3. Sproos retokenized the same documents with SP4096 BPE
+4. Sproos's manifest references `remote_repo_id = "willdepueoai/parameter-golf"`
diff --git a/records/track_10min_16mb/2026-04-08_SP4096_WD2_brotli_jpfeiffe/submission.json b/records/track_10min_16mb/2026-04-08_SP4096_WD2_brotli_jpfeiffe/submission.json
@@ -0,0 +1,82 @@
+{
+  "author": "Joel Pfeiffer",
+  "github_id": "jpfeiffe",
+  "name": "SP4096 + Compressibility Regularization (WD=2.0) + Brotli-11",
+  "blurb": "SP4096 tokenizer + warmdown WD ramp (2x) + brotli-11 compression on the #1019 GPTQ+XSA stack. WD compression frees 1.5MB to absorb SP4096's larger embedding. 6-seed exact mean: 1.11348911 BPB, beating PR #1019's 1.11473509 BPB by 0.00125 (Welch t=-4.19, df=6.6, p=0.00289). No TTT, no eval-time compute.",
+  "date": "2026-04-09",
+  "track": "10min_16mb",
+  "val_loss": 2.59428,
+  "val_bpb": 1.11348911,
+  "val_loss_std": 0.00089,
+  "val_bpb_std": 0.00052927,
+  "seeds": [314, 42, 999, 1337, 2024, 7],
+  "seed_results": {
+    "314": {
+      "val_loss": 2.59556426,
+      "val_bpb": 1.11409842,
+      "artifact_bytes": 15665083,
+      "steps": 6699,
+      "step_avg_ms": 89.0
+    },
+    "42": {
+      "val_loss": 2.59603741,
+      "val_bpb": 1.11417990,
+      "artifact_bytes": 15667940,
+      "steps": 6664,
+      "step_avg_ms": 90.0
+    },
+    "999": {
+      "val_loss": 2.59439602,
+      "val_bpb": 1.11347544,
+      "artifact_bytes": 15697830,
+      "steps": 6659,
+      "step_avg_ms": 90.0
+    },
+    "1337": {
+      "val_loss": 2.59345526,
+      "val_bpb": 1.11307168,
+      "artifact_bytes": 15660616,
+      "steps": 6658,
+      "step_avg_ms": 90.0
+    },
+    "2024": {
+      "val_loss": 2.59343677,
+      "val_bpb": 1.11306375,
+      "artifact_bytes": 15693397,
+      "steps": 6664,
+      "step_avg_ms": 90.0
+    },
+    "7": {
+      "val_loss": 2.59339416,
+      "val_bpb": 1.11304546,
+      "artifact_bytes": 15686495,
+      "steps": 6659,
+      "step_avg_ms": 90.0
+    }
+  },
+  "comparison_baseline_pr": 1019,
+  "delta_vs_pr1019_bpb": -0.00124599,
+  "t_statistic": -4.19,
+  "welch_df": 6.6,
+  "p_value_one_sided": 0.00289,
+  "artifact_bytes_mean": 15678560,
+  "artifact_bytes_max": 15697830,
+  "train_steps_mean": 6667,
+  "step_avg_ms_mean": 89.8,
+  "hardware": "8xH100 80GB SXM",
+  "pytorch_version": "2.9.1+cu128",
+  "flash_attn_version": "2.8.3 (FA3 Hopper kernels)",
+  "tokenizer": "SP4096 (from sproos/parameter-golf-tokenizers)",
+  "data_source": "sproos/parameter-golf-tokenizers (docs_sha256 matches official willdepueoai/parameter-golf)",
+  "no_eval_time_compute": true,
+  "techniques": [
+    "SP4096 tokenizer (larger vocabulary, fewer tokens per byte)",
+    "WARMDOWN_WD_MULT=2.0 (compressibility regularization during warmdown)",
+    "Brotli-11 compression (replacing LZMA-9)",
+    "AR self-gen GPTQ calibration (from #1019)",
+    "XSA on all 11 layers (from #1019)",
+    "BigramHash 3072x112 (from #1019)",
+    "Selective pruning disabled (0% on all seeds)"
+  ],
+  "technique_summary": "SP4096 + WD=2.0 warmdown ramp + Brotli-11 on #1019 GPTQ+XSA base"
+}