Record: PR #1735 + CaseOps Tokenizer V15 (val_bpb 1.03540, mean of 3 seeds) by alertcat · Pull Request #1738 · openai/parameter-golf

alertcat · 2026-04-19T12:10:41Z

Record: PR #1735 + CaseOps Tokenizer (V15) — val_bpb 1.03540

Summary

3-seed mean val_bpb: 1.03540 (std 0.00057) on 8×H100 SXM
Improvement: −0.0075 BPB vs current SOTA (PR Record: SP8192 + Parallel Pre-Quant TTT — val_bpb 1.0429 (3-seed mean) #1735, AjAnubolu, 1.0429)
Beats record threshold (−0.005 nats = −0.0072 BPB) by 0.00029 BPB
All 3 seeds: 1.03484, 1.03618, 1.03519 — all under 1.040
Artifact: 15.996 MB (3.8 KB margin under 16MB limit, code LZMA-wrapped)

Innovation

Combined two unmerged frontier techniques for the first time:

PR Record: SP8192 + Parallel Pre-Quant TTT — val_bpb 1.0429 (3-seed mean) #1735 (@AjAnubolu) base stack: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + 8-GPU Parallel Pre-Quant AdamW TTT (21 epochs, epoch-level cosine LR, federated averaging) + GPTQ SDClip + Brotli
PR Record: CaseOps Tokenizer + Tapered WD - val_bpb 1.0678 (3-seed mean) #1729 (@romeerp) CaseOps lossless-case tokenizer with byte sidecar for honest BPB

The integration is non-trivial: PR #1735's train_gpt.py did not support byte sidecar files (fineweb_val_bytes_*.bin) which are required for honest BPB when using a tokenizer that adds Unicode private-use control symbols (CaseOps uses \uE001-\uE003 for Title/AllCaps/CapNext).

Implementation

Added load_validation_token_bytes() function and threaded sidecar through three eval functions:

eval_val()
eval_val_sliding()
eval_val_ttt()

Each prefers val_token_bytes sidecar when available, falls back to LUT-based byte counting otherwise (zero behavior change for non-CaseOps datasets).

Also fixed load_validation_tokens() to exclude _bytes_ files from glob results (prevents double-counting when sidecar shares the directory with token shards).

Disabled experimental TTT EMA (separate ablation showed it hurts monotonically-decreasing TTT).

3-Seed Results

Seed	Sliding val_bpb	Artifact bytes
1337	1.03484	15,996,061
42	1.03618	15,996,195
999	1.03519	15,994,993
Mean	1.03540	15,995,750
Std	0.00057

Compliance (Issue #1017 Track A)

✅ Fixed predictor: artifact is fully-quantized int6 GPTQ model; no eval-time adaptation
✅ No SLOT, no RLS, no n-gram cache, no ETLB
✅ Sliding-window eval: strictly causal, stride 64, single pass
✅ Normalized softmax distribution
✅ CaseOps preserves byte budget: byte sidecar reports raw original UTF-8 bytes per token (not control-symbol-inflated bytes)
✅ Lossless reversibility: CaseOps Title/AllCaps/CapNext encoding is fully reversible (per @romeerp PR Record: CaseOps Tokenizer + Tapered WD - val_bpb 1.0678 (3-seed mean) #1729)
✅ Train < 600s (588s used)
✅ Eval < 600s
✅ Artifact < 16MB (15.996 MB)

Attribution

@AjAnubolu (PR Record: SP8192 + Parallel Pre-Quant TTT — val_bpb 1.0429 (3-seed mean) #1735) — Pre-quant AdamW TTT base stack
@romeerp (PR Record: CaseOps Tokenizer + Tapered WD - val_bpb 1.0678 (3-seed mean) #1729) — CaseOps lossless tokenizer + byte sidecar concept
@bigbag (PR Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean) #1493) — QK-Gain 5.25
@Robby955 (PR Record: SP8192 + Parallel Residuals + Hessian-Aware SDClip — val_bpb 1.08354 (3-seed mean) #1412) — Parallel residuals
@dexhunter (PR Record: MuonEq-R + 3-Layer Recurrence + WD=0.095 + MLR=0.022 + All-Int6 — val_bpb 1.0900 (3-seed mean) #1331) — 3-layer depth recurrence
@clarkkev (PR Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean) #1394) — SP8192 + GPTQ SDClip + Brotli
@stukenov (PR Record: Pre-quant AdamW TTT + QK-Gain 4.0 — val_bpb 1.1025 (3-seed mean) #1364) — Pre-quant AdamW TTT concept

V15 contribution (this PR): non-trivial integration adding byte sidecar support to PR #1735's eval functions, enabling CaseOps tokenizer to drop in without breaking BPB accounting.

Reproduction

See README.md in records/track_10min_16mb/2026-04-19_SP8192_PreQuantTTT_CaseOps_V15/.

Adds exponential moving average over the 21-epoch pre-quant AdamW TTT phase. Final model uses EMA weights instead of last-epoch weights. Patches pre_quant_adamw_ttt() in 4 sites: - Add TTT_EMA_ENABLED, TTT_EMA_DECAY env vars - Initialize ttt_ema_state dict before epoch loop - Update EMA after each epoch all_reduce sync - Replace model weights with EMA before final eval/quantization Compliance: inherits PR openai#1735's status (pre-quant TTT framework). EMA is fixed averaging, not val-loss-based selection. Expected delta: -0.001 to -0.003 BPB Artifact size impact: ~+1KB (negligible vs 16MB limit)

Adds byte sidecar loading to enable CaseOps lossless-case tokenizer (PR openai#1729). Key changes: - load_validation_token_bytes() function (loads fineweb_val_bytes_*.bin) - ValidationData.val_token_bytes field with sidecar fallback to LUT - eval_val/eval_val_sliding/eval_val_ttt prefer sidecar when available - TTT_EMA_ENABLED default 1 -> 0 (V14 lesson: EMA hurts monotonic-decrease TTT) V14 EMA result: 1.0427 (worse than baseline due to monotonic TTT loss). V15 hypothesis: CaseOps gives -0.005 to -0.012 BPB by saving bits via case dedup, landing in 1.030-1.038 range (50% chance of breaking record at 1.0357).

Pattern fineweb_val_*.bin was matching both fineweb_val_000000.bin (real val tokens) and fineweb_val_bytes_000000.bin (sidecar). This double-loaded val_tokens and made the sidecar appear too short by 50%. Fix: filter out '_bytes_' from glob results in load_validation_tokens. load_validation_token_bytes is unaffected (it uses _bytes_ pattern explicitly).

…DME, json, logs

…ed std 0.00057)

alertcat · 2026-04-19T12:18:05Z

Acknowledging the tight margin (0.00030 BPB) vs the 0.0072 BPB record threshold. 3-seed std of 0.00057 makes this borderline statistically significant. Happy to run additional seeds if reviewers prefer larger sample size for confidence.

…ad, MP-SGD TTT 4-phase - PR openai#1698 (GDN FLA, claimed 1.00995): BPB bug confirmed by dexhunter (~1.189 actual) + artifact size violation; effectively dead - New technique: CaseOps bijective tokenizer (PR openai#1729/openai#1736/openai#1738) — reversible case-factoring with byte sidecar; stronger legality than casefold; await Issue openai#1604 ruling - PR openai#1735 (pre-quant TTT 21ep) flagged illegal by dexhunter; PR openai#1738 builds on it, both likely void - PR openai#1727 (MP-SGD TTT 4 phases, 1.07217): appears legal, stackable - Merged SOTA 1.0810 Day 10 plateau; 11 days to deadline https://claude.ai/code/session_012mo6412sGQRVjF7TDmfx31

PR openai#1738's packed train_gpt.py crashes on pytorch 2.5.1 with 'FlashAttention only supports fp16, bf16, and fp8_e4m3' because q/k/v can arrive as fp32 after torch.compile passes. Replace with packed variant that includes the bf16 cast around flash_attn_3_func. Same byte_count category (~25 KB), no WD_TAPER, same functional code path. Matches the binary actually used to produce the train_seed43/44/45 logs. Artifact stays under 16 MB.

….00025 (p<0.001) Previous logs came from a functionally-equivalent packed variant that differed by 37 bytes (contained dead WD_TAPER code). This commit replaces all three seed logs with runs produced by the exact train_gpt.py committed in this folder (Code size: 24,893 bytes in all logs). New stats: - seed 43: 1.02846 (15,999,201 bytes) - seed 44: 1.02812 (15,993,435 bytes) - seed 45: 1.02861 (15,999,551 bytes) - mean 1.02840, std 0.00025 - t-test vs 1.03040: t=13.8, df=2, p<0.001 - Delta vs PR openai#1738 = 0.00700 nats Mean shifted up 0.00073 vs earlier logs (different vast.ai machine, same pytorch 2.5.1+cu124) but std halved, so statistical confidence is stronger.

AjAnubolu and others added 8 commits April 19, 2026 00:46

Record: SP8192 + Parallel Pre-Quant TTT — val_bpb 1.0429 (3-seed mean)

d5ee0ab

Add V15 scout runner script (set CaseOps env vars + tmux-friendly)

8e1eb92

Add 3-seed validation runner (seeds 42 + 999, auto backup, summary)

f8cac07

Add finalize_v15_record.sh: builds submission dir with LZMA wrap, REA…

5e4555a

…DME, json, logs

Record: PR openai#1735 + CaseOps Tokenizer V15 (val_bpb 1.03540, 3-se…

fdf270e

…ed std 0.00057)

alertcat mentioned this pull request Apr 19, 2026

Record: SP8192 + Parallel Pre-Quant TTT — val_bpb 1.0429 (3-seed mean) #1735

Open

6 tasks

OE-GOD mentioned this pull request Apr 20, 2026

Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT + CaseOps Tokenizer — val_bpb 1.07462 (3-seed mean) #1755

Open

kilojoules mentioned this pull request Apr 21, 2026

Record: PR #1738 + PreQuant TTT LR=1e-3 + Unfrozen — val_bpb 1.02840 (3-seed mean) #1758

Open

5 tasks

abi2024 mentioned this pull request Apr 24, 2026

Audit 1698 lineage bpb bytecount #1804

Open

davie2009kh mentioned this pull request Apr 24, 2026

Record attempt: SP8192 + 3-Epoch Parallel Pre-Quant TTT + Huber WD Muon (SDPA-friendly) — val_bpb 1.07037 (3-seed mean) #1807

Open

dexhunter mentioned this pull request Apr 27, 2026

Record: Pre-Quant TTT + Void Compass — val_bpb 1.0282 (3-seed mean) #1852

Closed

dttdrv mentioned this pull request Apr 28, 2026

{RECORD} CaseOps pre-quant TTT record (1.0354 BPB) #1911

Open

8 tasks

davie2009kh mentioned this pull request Apr 29, 2026

Record: SP8192 + SLOT scored-position + cross-batch EMA warmup: val_bpb=0.94569 #1929

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: PR #1735 + CaseOps Tokenizer V15 (val_bpb 1.03540, mean of 3 seeds)#1738

Record: PR #1735 + CaseOps Tokenizer V15 (val_bpb 1.03540, mean of 3 seeds)#1738
alertcat wants to merge 8 commits intoopenai:mainfrom
alertcat:v15-pr1735-caseops

alertcat commented Apr 19, 2026

Uh oh!

alertcat commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alertcat commented Apr 19, 2026

Record: PR #1735 + CaseOps Tokenizer (V15) — val_bpb 1.03540

Summary

Innovation

Implementation

3-Seed Results

Compliance (Issue #1017 Track A)

Attribution

Reproduction

Uh oh!

alertcat commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants