Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# Record: SP4096 + Compressibility Regularization

**val_bpb: 1.11349** (6-seed mean, std 0.00053) | **~15.68 MB** | 8xH100 SXM, 600s | No TTT

## Results

| Seed | Steps | ms/step | Pre-quant BPB | **Sliding BPB** | Artifact | Pruning |
|------|-------|---------|---------------|-----------------|----------|---------|
| 314 | 6,699 | 89 | 1.1260 | **1.11410** | 15,665,083 | 0% |
| 42 | 6,664 | 90 | 1.1261 | **1.11418** | 15,667,940 | 0% |
| 999 | 6,659 | 90 | 1.1255 | **1.11348** | 15,697,830 | 0% |
| 1337 | 6,658 | 90 | 1.1253 | **1.11307** | 15,660,616 | 0% |
| 2024 | 6,664 | 90 | 1.1261 | **1.11306** | 15,693,397 | 0% |
| 7 | 6,659 | 90 | 1.1255 | **1.11305** | 15,686,495 | 0% |
| **Mean** | | | | **1.11349** | | |

Exact 6-seed mean: **1.11348911 BPB**. Current merged SOTA (PR #1019) exact 3-seed mean: **1.11473509 BPB**. Welch's t-test: **t = -4.19**, **df = 6.6**, **p = 0.00289** (one-sided).

No TTT, no n-gram cache, no eval-time logit bias. All gains are from training-side changes.

---

## Changes

Three changes to the PR #1019 base:

### 1. SP4096 Tokenizer

Vocabulary increased from SP1024 to SP4096. Tokens-per-byte drops from ~0.59 to ~0.30, allowing the model to see more context per training step. The tied embedding grows from 1024x512 to 4096x512, adding ~1.1MB to the artifact.

SP4096 data from [sproos/parameter-golf-tokenizers](https://huggingface.co/sproos/parameter-golf-tokenizers), tokenized from the same FineWeb documents as the official SP1024 data (identical `docs_sha256`; see `data_lineage.md`).

### 2. WARMDOWN_WD_MULT=2.0

During LR warmdown, effective weight decay increases from 1x to 2x base WD. The mechanism: `group["weight_decay"] = base_wd * (1 + (mult - 1) * (1 - lr_scale))`, applied to all optimizer param groups before each step. Muon and AdamW both consume the updated WD via their standard `p.data.mul_(1.0 - lr * wd)` path.

This produces a more peaked post-quantization weight distribution (entropy 4.72 → 4.58 bits, zeros 8.3% → 11.4%), reducing brotli-compressed artifact size by ~1.5MB.

### 3. Brotli-11 Compression

Both lzma-9 and brotli-11 are computed; the smaller result is saved as the artifact. Brotli-11 was smaller on all 6 seeds. The load path auto-detects format (try lzma first, fall back to brotli).

### Why These Three Stack

WARMDOWN_WD_MULT=2.0 frees ~1.5MB of artifact budget through compression. This headroom absorbs SP4096's +1.1MB embedding cost. All 6 seeds fit under 16MB without selective pruning (0% on all seeds).

Without WARMDOWN_WD_MULT, SP4096 requires aggressive selective pruning (57.5% of +/-1 values zeroed) which destroys quality (SW BPB degrades from 1.113 to 1.136).

---

## Architecture

| Component | Setting |
|-----------|---------|
| Layers | 11 (512d, 8 GQA heads, 4 KV heads) |
| MLP | 3x (1536) with LeakyReLU(0.5)^2 |
| Attention | XSA on all 11 layers |
| BigramHash | 3072 x dim=112 |
| Tokenizer | **SP4096** |
| Quantization | INT6 per-row, GPTQ with AR self-gen calibration |
| Compression | **Brotli-11 selected when smaller than LZMA-9** |
| Weight Decay | **WARMDOWN_WD_MULT=2.0** (ramps from 1x to 2x during warmdown) |
| WARMDOWN_ITERS | 4000 |

---

## Verification

- Manual BPB recompute matches logged value to 4e-6 (`bpb_verification.md`)
- SP4096 tokenized from same FineWeb documents as SP1024 baseline; `docs_sha256` identical (`data_lineage.md`)

---

## Reproduction

```bash
# Download SP4096 data
python3 -c "
from huggingface_hub import snapshot_download
snapshot_download('sproos/parameter-golf-tokenizers',
allow_patterns=['datasets/fineweb10B_sp4096/*', 'tokenizers/fineweb_4096_bpe.*'],
local_dir='./data')
"

# Run (8xH100 SXM)
VOCAB_SIZE=4096 \
DATA_PATH=./data/datasets/fineweb10B_sp4096 \
TOKENIZER_PATH=./data/tokenizers/fineweb_4096_bpe.model \
BIGRAM_VOCAB_SIZE=3072 BIGRAM_DIM=112 \
WARMDOWN_ITERS=4000 WARMDOWN_WD_MULT=2.0 \
SEED=314 \
torchrun --standalone --nproc_per_node=8 train_gpt.py
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# BPB Verification
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# BPB Verification
# BPB Verification[]()


Manual byte-counting sanity check on the SP4096 validation shard.

## Method

1. Load SP4096 tokenizer (`fineweb_4096_bpe.model`)
2. Load val shard (`fineweb_val_000000.bin`): 44,848,122 tokens
3. For each target token, compute UTF-8 byte count using the same `build_sentencepiece_luts` logic as `train_gpt.py`
4. Compute BPB = (val_loss / ln2) * (tokens / bytes)

## Results

```
Val tokens: 44,848,122
Target tokens: 44,848,121
Total UTF-8 bytes: 150,755,442
Tokens per byte: 0.29748923

For seed 7 (val_loss = 2.59339416):
Manual BPB: 1.11304910
Reported BPB: 1.11304546
Difference: 0.0000036 (float64 accumulation order)
```

## Conclusion

Manual computation matches reported BPB within float64 precision (3.6e-6 difference from accumulation order in sequential loop vs batched GPU computation). The BPB calculation is correct.
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Data Lineage Verification
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@


SP4096 data from [sproos/parameter-golf-tokenizers](https://huggingface.co/sproos/parameter-golf-tokenizers) is tokenized from the **same FineWeb documents** as the official SP1024 data in [willdepueoai/parameter-golf](https://huggingface.co/datasets/willdepueoai/parameter-golf).

## Cryptographic Hash Match

`docs_selected.source_manifest.json` is **byte-for-byte identical** in both repos:

| Field | Official (willdepueoai) | Sproos |
|-------|------------------------|--------|
| `docs_sha256` | `84386dfa7b339a...d19bc7` | `84386dfa7b339a...d19bc7` |
| `num_docs` | 15,368,808 | 15,368,808 |
| `docs_val` | 50,000 | 50,000 |
| `docs_train` | 15,318,808 | 15,318,808 |
| `docs_bytes` | 48,166,275,520 | 48,166,275,520 |
| `selection_seed` | 1337 | 1337 |

## Val Token Counts

Same 50,000 documents, different tokenizations:

| Tokenizer | Val Tokens | Val Bytes |
|-----------|-----------|-----------|
| Official SP1024 | 62,021,846 | ~151M |
| Sproos SP4096 | 44,847,738 | ~151M |

Byte count is identical (same UTF-8 text). Token count differs because SP4096 has larger vocabulary → fewer tokens per byte.

## Lineage Chain

1. Official FineWeb 50k eval docs selected with `selection_seed=1337`
2. Documents hashed: `docs_sha256 = 84386dfa7b339a...d19bc7`
3. Sproos retokenized the same documents with SP4096 BPE
4. Sproos's manifest references `remote_repo_id = "willdepueoai/parameter-golf"`
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
{
"author": "Joel Pfeiffer",
"github_id": "jpfeiffe",
"name": "SP4096 + Compressibility Regularization (WD=2.0) + Brotli-11",
"blurb": "SP4096 tokenizer + warmdown WD ramp (2x) + brotli-11 compression on the #1019 GPTQ+XSA stack. WD compression frees 1.5MB to absorb SP4096's larger embedding. 6-seed exact mean: 1.11348911 BPB, beating PR #1019's 1.11473509 BPB by 0.00125 (Welch t=-4.19, df=6.6, p=0.00289). No TTT, no eval-time compute.",
"date": "2026-04-09",
"track": "10min_16mb",
"val_loss": 2.59428,
"val_bpb": 1.11348911,
"val_loss_std": 0.00089,
"val_bpb_std": 0.00052927,
"seeds": [314, 42, 999, 1337, 2024, 7],
"seed_results": {
"314": {
"val_loss": 2.59556426,
"val_bpb": 1.11409842,
"artifact_bytes": 15665083,
"steps": 6699,
"step_avg_ms": 89.0
},
"42": {
"val_loss": 2.59603741,
"val_bpb": 1.11417990,
"artifact_bytes": 15667940,
"steps": 6664,
"step_avg_ms": 90.0
},
"999": {
"val_loss": 2.59439602,
"val_bpb": 1.11347544,
"artifact_bytes": 15697830,
"steps": 6659,
"step_avg_ms": 90.0
},
"1337": {
"val_loss": 2.59345526,
"val_bpb": 1.11307168,
"artifact_bytes": 15660616,
"steps": 6658,
"step_avg_ms": 90.0
},
"2024": {
"val_loss": 2.59343677,
"val_bpb": 1.11306375,
"artifact_bytes": 15693397,
"steps": 6664,
"step_avg_ms": 90.0
},
"7": {
"val_loss": 2.59339416,
"val_bpb": 1.11304546,
"artifact_bytes": 15686495,
"steps": 6659,
"step_avg_ms": 90.0
}
},
"comparison_baseline_pr": 1019,
"delta_vs_pr1019_bpb": -0.00124599,
"t_statistic": -4.19,
"welch_df": 6.6,
"p_value_one_sided": 0.00289,
"artifact_bytes_mean": 15678560,
"artifact_bytes_max": 15697830,
"train_steps_mean": 6667,
"step_avg_ms_mean": 89.8,
"hardware": "8xH100 80GB SXM",
"pytorch_version": "2.9.1+cu128",
"flash_attn_version": "2.8.3 (FA3 Hopper kernels)",
"tokenizer": "SP4096 (from sproos/parameter-golf-tokenizers)",
"data_source": "sproos/parameter-golf-tokenizers (docs_sha256 matches official willdepueoai/parameter-golf)",
"no_eval_time_compute": true,
"techniques": [
"SP4096 tokenizer (larger vocabulary, fewer tokens per byte)",
"WARMDOWN_WD_MULT=2.0 (compressibility regularization during warmdown)",
"Brotli-11 compression (replacing LZMA-9)",
"AR self-gen GPTQ calibration (from #1019)",
"XSA on all 11 layers (from #1019)",
"BigramHash 3072x112 (from #1019)",
"Selective pruning disabled (0% on all seeds)"
],
"technique_summary": "SP4096 + WD=2.0 warmdown ramp + Brotli-11 on #1019 GPTQ+XSA base"
}
Loading