Skip to content

Record: PR #1735 + CaseOps Tokenizer V15 (val_bpb 1.03540, mean of 3 seeds)#1738

Open
alertcat wants to merge 8 commits intoopenai:mainfrom
alertcat:v15-pr1735-caseops
Open

Record: PR #1735 + CaseOps Tokenizer V15 (val_bpb 1.03540, mean of 3 seeds)#1738
alertcat wants to merge 8 commits intoopenai:mainfrom
alertcat:v15-pr1735-caseops

Conversation

@alertcat
Copy link
Copy Markdown

Record: PR #1735 + CaseOps Tokenizer (V15) — val_bpb 1.03540

Summary

  • 3-seed mean val_bpb: 1.03540 (std 0.00057) on 8×H100 SXM
  • Improvement: −0.0075 BPB vs current SOTA (PR Record: SP8192 + Parallel Pre-Quant TTT — val_bpb 1.0429 (3-seed mean) #1735, AjAnubolu, 1.0429)
  • Beats record threshold (−0.005 nats = −0.0072 BPB) by 0.00029 BPB
  • All 3 seeds: 1.03484, 1.03618, 1.03519 — all under 1.040
  • Artifact: 15.996 MB (3.8 KB margin under 16MB limit, code LZMA-wrapped)

Innovation

Combined two unmerged frontier techniques for the first time:

  1. PR Record: SP8192 + Parallel Pre-Quant TTT — val_bpb 1.0429 (3-seed mean) #1735 (@AjAnubolu) base stack: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + 8-GPU Parallel Pre-Quant AdamW TTT (21 epochs, epoch-level cosine LR, federated averaging) + GPTQ SDClip + Brotli
  2. PR Record: CaseOps Tokenizer + Tapered WD - val_bpb 1.0678 (3-seed mean) #1729 (@romeerp) CaseOps lossless-case tokenizer with byte sidecar for honest BPB

The integration is non-trivial: PR #1735's train_gpt.py did not support byte sidecar files (fineweb_val_bytes_*.bin) which are required for honest BPB when using a tokenizer that adds Unicode private-use control symbols (CaseOps uses \uE001-\uE003 for Title/AllCaps/CapNext).

Implementation

Added load_validation_token_bytes() function and threaded sidecar through three eval functions:

  • eval_val()
  • eval_val_sliding()
  • eval_val_ttt()

Each prefers val_token_bytes sidecar when available, falls back to LUT-based byte counting otherwise (zero behavior change for non-CaseOps datasets).

Also fixed load_validation_tokens() to exclude _bytes_ files from glob results (prevents double-counting when sidecar shares the directory with token shards).

Disabled experimental TTT EMA (separate ablation showed it hurts monotonically-decreasing TTT).

3-Seed Results

Seed Sliding val_bpb Artifact bytes
1337 1.03484 15,996,061
42 1.03618 15,996,195
999 1.03519 15,994,993
Mean 1.03540 15,995,750
Std 0.00057

Compliance (Issue #1017 Track A)

  • Fixed predictor: artifact is fully-quantized int6 GPTQ model; no eval-time adaptation
  • No SLOT, no RLS, no n-gram cache, no ETLB
  • Sliding-window eval: strictly causal, stride 64, single pass
  • Normalized softmax distribution
  • CaseOps preserves byte budget: byte sidecar reports raw original UTF-8 bytes per token (not control-symbol-inflated bytes)
  • Lossless reversibility: CaseOps Title/AllCaps/CapNext encoding is fully reversible (per @romeerp PR Record: CaseOps Tokenizer + Tapered WD - val_bpb 1.0678 (3-seed mean) #1729)
  • Train < 600s (588s used)
  • Eval < 600s
  • Artifact < 16MB (15.996 MB)

Attribution

V15 contribution (this PR): non-trivial integration adding byte sidecar support to PR #1735's eval functions, enabling CaseOps tokenizer to drop in without breaking BPB accounting.

Reproduction

See README.md in records/track_10min_16mb/2026-04-19_SP8192_PreQuantTTT_CaseOps_V15/.

AjAnubolu and others added 8 commits April 19, 2026 00:46
Adds exponential moving average over the 21-epoch pre-quant AdamW TTT phase.
Final model uses EMA weights instead of last-epoch weights.

Patches pre_quant_adamw_ttt() in 4 sites:
- Add TTT_EMA_ENABLED, TTT_EMA_DECAY env vars
- Initialize ttt_ema_state dict before epoch loop
- Update EMA after each epoch all_reduce sync
- Replace model weights with EMA before final eval/quantization

Compliance: inherits PR openai#1735's status (pre-quant TTT framework).
EMA is fixed averaging, not val-loss-based selection.

Expected delta: -0.001 to -0.003 BPB
Artifact size impact: ~+1KB (negligible vs 16MB limit)
Adds byte sidecar loading to enable CaseOps lossless-case tokenizer (PR openai#1729).
Key changes:
- load_validation_token_bytes() function (loads fineweb_val_bytes_*.bin)
- ValidationData.val_token_bytes field with sidecar fallback to LUT
- eval_val/eval_val_sliding/eval_val_ttt prefer sidecar when available
- TTT_EMA_ENABLED default 1 -> 0 (V14 lesson: EMA hurts monotonic-decrease TTT)

V14 EMA result: 1.0427 (worse than baseline due to monotonic TTT loss).
V15 hypothesis: CaseOps gives -0.005 to -0.012 BPB by saving bits via case dedup,
landing in 1.030-1.038 range (50% chance of breaking record at 1.0357).
Pattern fineweb_val_*.bin was matching both fineweb_val_000000.bin (real
val tokens) and fineweb_val_bytes_000000.bin (sidecar). This double-loaded
val_tokens and made the sidecar appear too short by 50%.

Fix: filter out '_bytes_' from glob results in load_validation_tokens.
load_validation_token_bytes is unaffected (it uses _bytes_ pattern explicitly).
@alertcat
Copy link
Copy Markdown
Author

Acknowledging the tight margin (0.00030 BPB) vs the 0.0072 BPB record threshold. 3-seed std of 0.00057 makes this borderline statistically significant. Happy to run additional seeds if reviewers prefer larger sample size for confidence.

sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Apr 19, 2026
…ad, MP-SGD TTT 4-phase

- PR openai#1698 (GDN FLA, claimed 1.00995): BPB bug confirmed by dexhunter
  (~1.189 actual) + artifact size violation; effectively dead
- New technique: CaseOps bijective tokenizer (PR openai#1729/openai#1736/openai#1738) —
  reversible case-factoring with byte sidecar; stronger legality than
  casefold; await Issue openai#1604 ruling
- PR openai#1735 (pre-quant TTT 21ep) flagged illegal by dexhunter; PR openai#1738
  builds on it, both likely void
- PR openai#1727 (MP-SGD TTT 4 phases, 1.07217): appears legal, stackable
- Merged SOTA 1.0810 Day 10 plateau; 11 days to deadline

https://claude.ai/code/session_012mo6412sGQRVjF7TDmfx31
kilojoules added a commit to kilojoules/parameter-golf that referenced this pull request Apr 21, 2026
PR openai#1738's packed train_gpt.py crashes on pytorch 2.5.1 with 'FlashAttention
only supports fp16, bf16, and fp8_e4m3' because q/k/v can arrive as fp32 after
torch.compile passes. Replace with packed variant that includes the bf16 cast
around flash_attn_3_func. Same byte_count category (~25 KB), no WD_TAPER,
same functional code path. Matches the binary actually used to produce the
train_seed43/44/45 logs. Artifact stays under 16 MB.
kilojoules added a commit to kilojoules/parameter-golf that referenced this pull request Apr 21, 2026
….00025 (p<0.001)

Previous logs came from a functionally-equivalent packed variant that differed
by 37 bytes (contained dead WD_TAPER code). This commit replaces all three
seed logs with runs produced by the exact train_gpt.py committed in this folder
(Code size: 24,893 bytes in all logs). New stats:

- seed 43: 1.02846 (15,999,201 bytes)
- seed 44: 1.02812 (15,993,435 bytes)
- seed 45: 1.02861 (15,999,551 bytes)
- mean 1.02840, std 0.00025
- t-test vs 1.03040: t=13.8, df=2, p<0.001
- Delta vs PR openai#1738 = 0.00700 nats

Mean shifted up 0.00073 vs earlier logs (different vast.ai machine, same
pytorch 2.5.1+cu124) but std halved, so statistical confidence is stronger.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants