diff --git a/records/track_non_record_16mb/2026-03-31_CustomTokenizer_SplitDigits_WebSymbols/README.md b/records/track_non_record_16mb/2026-03-31_CustomTokenizer_SplitDigits_WebSymbols/README.md new file mode 100644 index 0000000000..0a8cc06131 --- /dev/null +++ b/records/track_non_record_16mb/2026-03-31_CustomTokenizer_SplitDigits_WebSymbols/README.md @@ -0,0 +1,219 @@ +# Custom Tokenizer for Parameter Golf: Web-Content Symbols + split_digits=false + +**Non-Record Submission (Untested - Community Contribution)** +**Author:** Mikeapedia ([@mikeapedia](https://github.com/mikeapedia)) +**Date:** 2026-03-31 + +--- + +> **I have not tested this.** I don't currently have H100 access, so I'm sharing the idea, the tooling, and a ready-to-use pre-tokenized dataset with the community in the hope that someone will pick it up and run with it. The tokenizer is trained, the corpus is retokenized, and everything is on HuggingFace -- all you need is a GPU and two env vars. If you test it, please post results in the PR comments or Discord, even if it's a negative result. That would still tell us whether tokenizer optimization is a viable axis for this competition. + +--- + +## The Short Version + +The competition allows custom tokenizers ("we let you bring your own"), but nobody has tried one yet. I trained a custom SentencePiece BPE tokenizer optimized for FineWeb's web-crawled text, with two key changes from the default: + +1. **`split_digits=false`**: Keep number sequences as single tokens (e.g., "2024" = 1 token instead of 4) +2. **10 `user_defined_symbols`** for common web patterns (URLs, TLDs) + +The tokenizer and pre-tokenized binary shards are uploaded to HuggingFace. The existing `train_gpt.py` supports custom tokenizers out of the box via `TOKENIZER_PATH` and `DATA_PATH` env vars -- no code changes needed. + +--- + +## Motivation + +The default SentencePiece tokenizer treats all text equally. But FineWeb is web-crawled content, and certain patterns appear with high frequency: + +- URLs (`https://www.example.com`) consume many tokens because "https", "://", "www", ".", "com" are each separate BPE merges +- Numbers like years ("2024"), prices ("$199"), and IDs get split digit-by-digit with `split_digits=true` (the default) + +By pre-defining common web patterns as atomic symbols and keeping digit sequences intact, we hypothesize the tokenizer can represent web text more efficiently, potentially improving bpb. + +--- + +## Approach + +### Step 1: Corpus Frequency Analysis + +Before choosing symbols, I analyzed 100,000 FineWeb documents to measure actual pattern frequencies. This was critical because **FineWeb is cleaned text, not raw HTML**. Many HTML-oriented symbols (tags, attributes) that seem obvious are actually very rare. + +**Key findings from `analyze_patterns.py`:** + +| Pattern | Hits per 100K docs | Status | +|---------|-------------------|--------| +| `.com` | 40,236 | Included | +| `https://` | 30,662 | Included | +| `http://` | 13,789 | Included | +| `www.` | 16,555 | Included | +| `.org` | 12,989 | Included | +| `.net` | 4,283 | Included | +| `.html` | 2,145 | Included | +| `.gov` | 1,890 | Included | +| `.edu` | 1,522 | Included | +| `.co.uk` | 978 | Included | +| `` | 156 | **Excluded** | + +The cutoff was ~500 hits per 100K docs. Below that, the symbol wastes a BPE merge slot for negligible frequency savings. + +### Step 2: Understanding the Budget + +With `vocab_size=1024`: +- 3 control tokens (unk=0, bos=1, eos=2) +- 256 byte fallback tokens +- 10 user_defined_symbols +- **755 BPE merges remaining** (vs 765 with the default tokenizer) + +Each user_defined_symbol costs one BPE merge slot. The 10 symbols we chose are high-frequency enough to justify that cost. + +### Step 3: Tokenizer Training + +```python +spm.SentencePieceTrainer.train( + sentence_iterator=corpus_iterator, + model_prefix="fineweb_1024_custom", + vocab_size=1024, + model_type="bpe", + byte_fallback=True, + split_digits=False, # Key change #1 + user_defined_symbols=[ # Key change #2 + "http://", "https://", "www.", + ".com", ".org", ".net", + ".gov", ".html", ".edu", ".co.uk", + ], + max_sentence_length=16384, + bos_id=1, eos_id=2, unk_id=0, + num_threads=16, + input_sentence_size=50_000_000, + shuffle_input_sentence=True, + train_extremely_large_corpus=True, +) +``` + +Trained on 5,000,000 documents from `docs_selected.jsonl` (per `manifest.json`). + +### Step 4: Retokenization + +The entire FineWeb corpus was retokenized into competition-format binary shards: +- **Format**: 256 x int32 header + N x uint16 LE tokens (same as original) +- **Val**: 63,770,657 tokens (1 shard) +- **Train**: 8,000,000,316 tokens (81 shards) + +The custom tokenizer produces **2.8% more val tokens** than the original (63.8M vs 62.0M), meaning it's slightly less compressive overall. However, the hypothesis is that better token boundaries (keeping numbers intact, treating URLs as units) may improve model learning despite the slightly longer sequences. + +--- + +## Pre-Built Data on HuggingFace + +Everything is uploaded and ready to use: + +**Repository:** [Mikeapedia/parameter-golf-data](https://huggingface.co/datasets/Mikeapedia/parameter-golf-data) + +Contents: +``` +tokenizers/ + fineweb_1024_custom.model # 254KB SentencePiece model + +datasets/ + fineweb10B_sp1024_custom/ + fineweb_val_000000.bin # Validation shard + fineweb_train_000000.bin # Training shards (81 files) + ... + fineweb_train_000080.bin +``` + +### Download + +```bash +# Install HuggingFace CLI +pip install huggingface_hub + +# Download tokenizer +huggingface-cli download Mikeapedia/parameter-golf-data \ + tokenizers/fineweb_1024_custom.model \ + --local-dir ./data + +# Download all shards (~16GB) +huggingface-cli download Mikeapedia/parameter-golf-data \ + --include "datasets/fineweb10B_sp1024_custom/*" \ + --local-dir ./data +``` + +--- + +## How to Use + +The existing `train_gpt.py` already supports custom tokenizers via environment variables. No code changes needed: + +```bash +TOKENIZER_PATH=data/tokenizers/fineweb_1024_custom.model \ +DATA_PATH=data/datasets/fineweb10B_sp1024_custom \ +torchrun --nproc_per_node=8 train_gpt.py +``` + +`train_gpt.py`'s `build_sentencepiece_luts()` (line ~589) dynamically reads byte counts from any SentencePiece model file, so the BPB calculation adjusts automatically. + +--- + +## What BPB Means with a Custom Tokenizer + +BPB (bits per byte) normalizes across tokenizers: + +``` +BPB = (cross_entropy_loss / ln(2)) * (num_tokens / num_bytes) +``` + +A tokenizer that produces more tokens increases the `tokens/bytes` ratio, but if the model learns better representations from cleaner token boundaries, the `loss` term should decrease enough to compensate. The net effect on BPB is what matters -- and that's what we need H100 compute to measure. + +--- + +## Scripts Included + +- **`retokenize.py`**: End-to-end pipeline that trains the tokenizer and retokenizes the corpus into binary shards. Supports `--skip-train-tokenizer` and `--skip-retokenize` flags. +- **`analyze_patterns.py`**: Frequency analysis tool that scans FineWeb documents for web patterns and ranks them by occurrence count. + +--- + +## Looking for Someone to Test This + +**I can't test this myself -- no H100 access.** Everything is ready to go: the tokenizer is trained, 82 binary shards are on HuggingFace, and `train_gpt.py` supports it natively with zero code changes. If you have compute and want to try something nobody else has explored yet, here's what to do: + +```bash +# 1. Download (~16GB) +huggingface-cli download Mikeapedia/parameter-golf-data --local-dir ./data + +# 2. Train (identical to normal, just two env vars) +TOKENIZER_PATH=data/tokenizers/fineweb_1024_custom.model \ +DATA_PATH=data/datasets/fineweb10B_sp1024_custom \ +torchrun --nproc_per_node=8 train_gpt.py +``` + +**Questions I'd love answered:** + +1. Does the custom tokenizer improve, hurt, or not affect val_bpb? +2. Does `split_digits=false` help with number-heavy validation passages? +3. Is the 2.8% token count increase a problem for training throughput? + +Please share results in the PR comments or on Discord -- even a negative result tells us something useful about whether tokenizer optimization matters for this competition. + +--- + +## Reproducing the Tokenizer + +If you want to retrain the tokenizer or retokenize with different settings: + +```bash +# Full pipeline (requires docs_selected.jsonl from FineWeb) +python retokenize.py + +# Skip tokenizer training, just retokenize with existing model +python retokenize.py --skip-train-tokenizer + +# Train on fewer shards for testing +python retokenize.py --train-shards 5 +``` + +The tokenizer training takes ~15 minutes on a modern CPU. Retokenization of 82 shards takes ~2 hours. diff --git a/records/track_non_record_16mb/2026-03-31_CustomTokenizer_SplitDigits_WebSymbols/analyze_patterns.py b/records/track_non_record_16mb/2026-03-31_CustomTokenizer_SplitDigits_WebSymbols/analyze_patterns.py new file mode 100644 index 0000000000..2b737a3213 --- /dev/null +++ b/records/track_non_record_16mb/2026-03-31_CustomTokenizer_SplitDigits_WebSymbols/analyze_patterns.py @@ -0,0 +1,225 @@ +"""Analyze docs_selected.jsonl for high-frequency web patterns to inform user_defined_symbols.""" +import json +import sys +from collections import Counter +from pathlib import Path + +SAMPLE_SIZE = 100_000 + +# Patterns to count — grouped by category +PATTERNS = { + # Already in user's list (for comparison) + "http://": "url", + "https://": "url", + "www.": "url", + ".com": "tld", + ".org": "tld", + ".net": "tld", + ".io": "tld", + "": "html", + "href=": "html-attr", + "src=": "html-attr", + "class=": "html-attr", + "id=": "html-attr", + # Candidate additions — HTML close tags + "": "html-close", + "

": "html-close", + "": "html-close", + "": "html-close", + "": "html-close", + "": "html-close", + "": "html-close", + "": "html-close", + "": "html-close", + "": "html-close", + "": "html-close", + "": "html-close", + "": "html-close", + "": "html-close", + "": "html-close", + "": "html-close", + "": "html-close", + "": "html-close", + "": "html-close", + "": "html-close", + "": "html-close", + # Candidate additions — HTML open/self-closing + "": "html-open", + "
": "html-open", + "
": "html-open", + "
": "html-open", + "': "punct", + "'>": "punct", + "": "comment", +} + +def main(): + jsonl_path = Path(__file__).parent / "docs_selected.jsonl" + if not jsonl_path.exists(): + print(f"ERROR: {jsonl_path} not found", file=sys.stderr) + sys.exit(1) + + counts = Counter() + doc_counts = Counter() # how many docs contain each pattern (for coverage) + total_bytes = 0 + + print(f"Sampling {SAMPLE_SIZE:,} docs from {jsonl_path.name}...") + with open(jsonl_path, "r", encoding="utf-8") as f: + for i, line in enumerate(f): + if i >= SAMPLE_SIZE: + break + try: + doc = json.loads(line) + text = doc.get("text", "") + except json.JSONDecodeError: + continue + + total_bytes += len(text.encode("utf-8")) + seen = set() + for pattern in PATTERNS: + c = text.count(pattern) + if c > 0: + counts[pattern] += c + if pattern not in seen: + doc_counts[pattern] += 1 + seen.add(pattern) + + if (i + 1) % 10_000 == 0: + print(f" processed {i+1:,} docs...") + + print(f"\nAnalyzed {min(i+1, SAMPLE_SIZE):,} docs, {total_bytes:,} bytes total") + print(f"Average doc size: {total_bytes / min(i+1, SAMPLE_SIZE):,.0f} bytes\n") + + # Sort by total occurrences descending + print(f"{'Pattern':<25} {'Category':<12} {'Total Hits':>12} {'Docs w/ Pattern':>16} {'Hits/Doc':>10}") + print("-" * 80) + + already_in_list = { + "http://", "https://", "www.", ".com", ".org", ".net", ".io", + "", "href=", "src=", "class=", "id=" + } + + # Print patterns already in the user's list + print("\n=== ALREADY IN YOUR LIST ===") + for pattern, total in counts.most_common(): + if pattern in already_in_list: + cat = PATTERNS[pattern] + docs = doc_counts[pattern] + ratio = total / min(i+1, SAMPLE_SIZE) + print(f" {pattern:<23} {cat:<12} {total:>12,} {docs:>16,} {ratio:>10.1f}") + + # Print candidate additions sorted by frequency + print("\n=== CANDIDATE ADDITIONS (sorted by total hits) ===") + candidates = [(p, c) for p, c in counts.most_common() if p not in already_in_list] + for pattern, total in candidates: + cat = PATTERNS[pattern] + docs = doc_counts[pattern] + ratio = total / min(i+1, SAMPLE_SIZE) + marker = " ***" if total >= 10_000 else "" + print(f" {pattern:<23} {cat:<12} {total:>12,} {docs:>16,} {ratio:>10.1f}{marker}") + + # Summary: top candidates above threshold + print("\n=== TOP CANDIDATES (>= 10K total hits in 100K docs) ===") + top = [(p, c) for p, c in candidates if c >= 10_000] + for pattern, total in top: + cat = PATTERNS[pattern] + bytes_saved = len(pattern.encode("utf-8")) - 1 # bytes saved per occurrence (1 token instead of N bytes) + print(f" {pattern:<23} {total:>12,} hits ({bytes_saved} bytes saved/hit)") + + print(f"\n Total top candidates: {len(top)}") + print(f" Current symbols: {len(already_in_list)}") + print(f" Proposed total: {len(already_in_list) + len(top)}") + print(f" BPE merges remaining: {1024 - 3 - 256 - len(already_in_list) - len(top)}") + + +if __name__ == "__main__": + main() diff --git a/records/track_non_record_16mb/2026-03-31_CustomTokenizer_SplitDigits_WebSymbols/retokenize.py b/records/track_non_record_16mb/2026-03-31_CustomTokenizer_SplitDigits_WebSymbols/retokenize.py new file mode 100644 index 0000000000..9daec224b8 --- /dev/null +++ b/records/track_non_record_16mb/2026-03-31_CustomTokenizer_SplitDigits_WebSymbols/retokenize.py @@ -0,0 +1,329 @@ +"""Train a custom SentencePiece tokenizer and retokenize docs_selected.jsonl into binary shards. + +Usage: + # Full pipeline: train tokenizer + retokenize + uv run data/retokenize.py + + # Only train tokenizer + uv run data/retokenize.py --skip-retokenize + + # Only retokenize (reuse existing .model) + uv run data/retokenize.py --skip-train-tokenizer + + # Limit training shards for faster iteration + uv run data/retokenize.py --train-shards 10 +""" +import argparse +import functools +import json +import os +import random +import sys +import time +from pathlib import Path + +import numpy as np + +# Force unbuffered output so progress is visible when piped +print = functools.partial(print, flush=True) + +SHARD_MAGIC = 20240520 +SHARD_VERSION = 1 +HEADER_INTS = 256 +DEFAULT_SHARD_SIZE = 100_000_000 # tokens per shard (from manifest) +DEFAULT_NUM_VAL_DOCS = 50_000 +DEFAULT_TOKENIZER_TRAIN_DOCS = 5_000_000 +DEFAULT_SHUFFLE_SEED = 1337 + +USER_DEFINED_SYMBOLS = [ + "http://", "https://", "www.", + ".com", ".org", ".net", + ".gov", ".html", ".edu", ".co.uk", +] + + +def iter_jsonl_texts(path: Path, limit: int | None = None): + """Yield document texts from a JSONL file, one per line.""" + count = 0 + with open(path, "r", encoding="utf-8") as f: + for line in f: + if limit is not None and count >= limit: + break + try: + doc = json.loads(line) + text = doc.get("text", "") + if text: + yield text + count += 1 + except json.JSONDecodeError: + continue + + +def train_tokenizer( + input_path: Path, + model_prefix: str, + vocab_size: int, + num_train_docs: int, +): + """Train a SentencePiece BPE tokenizer on the first num_train_docs documents.""" + import sentencepiece as spm + + print(f"Training SentencePiece BPE tokenizer (vocab_size={vocab_size})...") + print(f" Training docs: {num_train_docs:,}") + print(f" user_defined_symbols: {USER_DEFINED_SYMBOLS}") + print(f" split_digits: False") + print(f" max_sentence_length: 16384") + print(f" Output: {model_prefix}.model") + + # Create an iterator that yields text strings for training + doc_iter = iter_jsonl_texts(input_path, limit=num_train_docs) + + t0 = time.time() + spm.SentencePieceTrainer.train( + sentence_iterator=doc_iter, + model_prefix=model_prefix, + model_type="bpe", + vocab_size=vocab_size, + # User's requested settings + split_digits=False, + user_defined_symbols=USER_DEFINED_SYMBOLS, + max_sentence_length=16384, + # Required for byte-level coverage (BPB scoring uses sp.is_byte()) + byte_fallback=True, + character_coverage=0.9995, + # Match original tokenizer control token IDs + unk_id=0, + bos_id=1, + eos_id=2, + pad_id=-1, + # Performance + num_threads=os.cpu_count() or 4, + train_extremely_large_corpus=True, + ) + elapsed = time.time() - t0 + print(f" Tokenizer trained in {elapsed:.1f}s") + + # Verify + sp = spm.SentencePieceProcessor(model_file=f"{model_prefix}.model") + actual_vocab = sp.vocab_size() + print(f" Actual vocab size: {actual_vocab}") + if actual_vocab != vocab_size: + print(f" WARNING: vocab size mismatch! Expected {vocab_size}, got {actual_vocab}") + sys.exit(1) + + # Verify user_defined_symbols are single tokens + for sym in USER_DEFINED_SYMBOLS: + tokens = sp.encode(sym, out_type=str) + token_strs = [t.replace("\u2581", "_") for t in tokens] # safe for Windows console + if len(tokens) != 1 or tokens[0].replace("\u2581", "") != sym: + encoded_ids = sp.encode(sym) + print(f" Note: '{sym}' encodes as {token_strs} (ids: {encoded_ids})") + + print(" Tokenizer training complete.") + return sp + + +class ShardWriter: + """Accumulates token IDs and writes shards in the competition binary format.""" + + def __init__(self, output_dir: Path, prefix: str, shard_size: int): + self.output_dir = output_dir + self.prefix = prefix + self.shard_size = shard_size + self.shard_idx = 0 + self.buffer = [] + self.buffer_len = 0 + self.total_tokens = 0 + + def add_tokens(self, ids: list[int]): + self.buffer.extend(ids) + self.buffer_len += len(ids) + while self.buffer_len >= self.shard_size: + self._flush_shard(self.shard_size) + + def _flush_shard(self, count: int): + tokens = np.array(self.buffer[:count], dtype=" 0: + tokens = np.array(self.buffer, dtype="= max_train_tokens: + # Stop after enough tokens for the requested number of shards + break + if (i + 1) % 100_000 == 0: + elapsed = time.time() - t0 + rate = (i + 1) / elapsed + remaining = (len(train_texts) - i - 1) / rate if rate > 0 else 0 + print(f" train: {i+1:,}/{len(train_texts):,} docs ({elapsed:.0f}s, ~{remaining:.0f}s remaining)") + train_writer.close() + train_elapsed = time.time() - t0 + print(f" Train complete: {train_writer.total_tokens:,} tokens in {train_writer.shard_idx} shard(s) ({train_elapsed:.1f}s)") + + # Summary + total = val_writer.total_tokens + train_writer.total_tokens + print(f"\n Summary:") + print(f" Total tokens: {total:,}") + print(f" Val tokens: {val_writer.total_tokens:,} ({val_writer.shard_idx} shards)") + print(f" Train tokens: {train_writer.total_tokens:,} ({train_writer.shard_idx} shards)") + print(f" Avg tokens/doc (val): {val_writer.total_tokens / len(val_texts):,.1f}") + + +def build_parser() -> argparse.ArgumentParser: + parser = argparse.ArgumentParser(description="Train custom tokenizer and retokenize corpus") + parser.add_argument( + "--input", type=Path, + default=Path(__file__).parent / "docs_selected.jsonl", + help="Path to docs_selected.jsonl", + ) + parser.add_argument( + "--output-dir", type=Path, + default=Path(__file__).parent / "datasets" / "fineweb10B_sp1024_custom", + help="Output directory for binary shards", + ) + parser.add_argument( + "--tokenizer-prefix", type=str, + default=str(Path(__file__).parent / "tokenizers" / "fineweb_1024_custom"), + help="Output prefix for tokenizer model (without .model extension)", + ) + parser.add_argument("--vocab-size", type=int, default=1024) + parser.add_argument("--train-docs", type=int, default=DEFAULT_TOKENIZER_TRAIN_DOCS, + help="Number of docs for tokenizer training") + parser.add_argument("--shard-size", type=int, default=DEFAULT_SHARD_SIZE, + help="Tokens per shard") + parser.add_argument("--num-val-docs", type=int, default=DEFAULT_NUM_VAL_DOCS) + parser.add_argument("--shuffle-seed", type=int, default=DEFAULT_SHUFFLE_SEED) + parser.add_argument("--train-shards", type=int, default=None, + help="Limit number of training shards to produce") + parser.add_argument("--skip-train-tokenizer", action="store_true", + help="Skip tokenizer training, reuse existing .model") + parser.add_argument("--skip-retokenize", action="store_true", + help="Only train tokenizer, skip retokenization") + return parser + + +def main(): + args = build_parser().parse_args() + + if not args.input.exists(): + print(f"ERROR: {args.input} not found. Run:") + print(f" uv run data/cached_challenge_fineweb.py --with-docs --train-shards 0") + sys.exit(1) + + tokenizer_model_path = Path(f"{args.tokenizer_prefix}.model") + + # Phase A: Train tokenizer + if not args.skip_train_tokenizer: + Path(args.tokenizer_prefix).parent.mkdir(parents=True, exist_ok=True) + train_tokenizer( + input_path=args.input, + model_prefix=args.tokenizer_prefix, + vocab_size=args.vocab_size, + num_train_docs=args.train_docs, + ) + else: + if not tokenizer_model_path.exists(): + print(f"ERROR: --skip-train-tokenizer but {tokenizer_model_path} not found") + sys.exit(1) + print(f"Skipping tokenizer training, using {tokenizer_model_path}") + + # Phase B: Retokenize + if not args.skip_retokenize: + retokenize( + input_path=args.input, + tokenizer_path=tokenizer_model_path, + output_dir=args.output_dir, + shard_size=args.shard_size, + num_val_docs=args.num_val_docs, + shuffle_seed=args.shuffle_seed, + max_train_shards=args.train_shards, + ) + else: + print("Skipping retokenization.") + + print("\nDone! To train with the custom tokenizer:") + print(f" TOKENIZER_PATH={tokenizer_model_path} \\") + print(f" DATA_PATH={args.output_dir} \\") + print(f" torchrun --standalone --nproc_per_node=1 train_gpt.py") + + +if __name__ == "__main__": + main() diff --git a/records/track_non_record_16mb/2026-03-31_CustomTokenizer_SplitDigits_WebSymbols/submission.json b/records/track_non_record_16mb/2026-03-31_CustomTokenizer_SplitDigits_WebSymbols/submission.json new file mode 100644 index 0000000000..b1743f3e94 --- /dev/null +++ b/records/track_non_record_16mb/2026-03-31_CustomTokenizer_SplitDigits_WebSymbols/submission.json @@ -0,0 +1,12 @@ +{ + "track": "non_record_16mb", + "date": "2026-03-31", + "name": "Custom Tokenizer: split_digits=false + Web Symbols", + "author": "Mikeapedia", + "github_id": "mikeapedia", + "blurb": "Custom SentencePiece BPE with web-content user_defined_symbols and split_digits=false. Untested - seeking H100 access to evaluate.", + "val_bpb": null, + "val_loss": null, + "bytes_total": null, + "bytes_code": null +}