Record: PR #1735 + CaseOps Tokenizer V15 (val_bpb 1.03540, mean of 3 seeds)#1738
Open
alertcat wants to merge 8 commits intoopenai:mainfrom
Open
Record: PR #1735 + CaseOps Tokenizer V15 (val_bpb 1.03540, mean of 3 seeds)#1738alertcat wants to merge 8 commits intoopenai:mainfrom
alertcat wants to merge 8 commits intoopenai:mainfrom
Conversation
Adds exponential moving average over the 21-epoch pre-quant AdamW TTT phase. Final model uses EMA weights instead of last-epoch weights. Patches pre_quant_adamw_ttt() in 4 sites: - Add TTT_EMA_ENABLED, TTT_EMA_DECAY env vars - Initialize ttt_ema_state dict before epoch loop - Update EMA after each epoch all_reduce sync - Replace model weights with EMA before final eval/quantization Compliance: inherits PR openai#1735's status (pre-quant TTT framework). EMA is fixed averaging, not val-loss-based selection. Expected delta: -0.001 to -0.003 BPB Artifact size impact: ~+1KB (negligible vs 16MB limit)
Adds byte sidecar loading to enable CaseOps lossless-case tokenizer (PR openai#1729). Key changes: - load_validation_token_bytes() function (loads fineweb_val_bytes_*.bin) - ValidationData.val_token_bytes field with sidecar fallback to LUT - eval_val/eval_val_sliding/eval_val_ttt prefer sidecar when available - TTT_EMA_ENABLED default 1 -> 0 (V14 lesson: EMA hurts monotonic-decrease TTT) V14 EMA result: 1.0427 (worse than baseline due to monotonic TTT loss). V15 hypothesis: CaseOps gives -0.005 to -0.012 BPB by saving bits via case dedup, landing in 1.030-1.038 range (50% chance of breaking record at 1.0357).
Pattern fineweb_val_*.bin was matching both fineweb_val_000000.bin (real val tokens) and fineweb_val_bytes_000000.bin (sidecar). This double-loaded val_tokens and made the sidecar appear too short by 50%. Fix: filter out '_bytes_' from glob results in load_validation_tokens. load_validation_token_bytes is unaffected (it uses _bytes_ pattern explicitly).
…ed std 0.00057)
Author
|
Acknowledging the tight margin (0.00030 BPB) vs the 0.0072 BPB record threshold. 3-seed std of 0.00057 makes this borderline statistically significant. Happy to run additional seeds if reviewers prefer larger sample size for confidence. |
sunnypatneedi
pushed a commit
to sunnypatneedi/parameter-golf
that referenced
this pull request
Apr 19, 2026
…ad, MP-SGD TTT 4-phase - PR openai#1698 (GDN FLA, claimed 1.00995): BPB bug confirmed by dexhunter (~1.189 actual) + artifact size violation; effectively dead - New technique: CaseOps bijective tokenizer (PR openai#1729/openai#1736/openai#1738) — reversible case-factoring with byte sidecar; stronger legality than casefold; await Issue openai#1604 ruling - PR openai#1735 (pre-quant TTT 21ep) flagged illegal by dexhunter; PR openai#1738 builds on it, both likely void - PR openai#1727 (MP-SGD TTT 4 phases, 1.07217): appears legal, stackable - Merged SOTA 1.0810 Day 10 plateau; 11 days to deadline https://claude.ai/code/session_012mo6412sGQRVjF7TDmfx31
6 tasks
5 tasks
kilojoules
added a commit
to kilojoules/parameter-golf
that referenced
this pull request
Apr 21, 2026
PR openai#1738's packed train_gpt.py crashes on pytorch 2.5.1 with 'FlashAttention only supports fp16, bf16, and fp8_e4m3' because q/k/v can arrive as fp32 after torch.compile passes. Replace with packed variant that includes the bf16 cast around flash_attn_3_func. Same byte_count category (~25 KB), no WD_TAPER, same functional code path. Matches the binary actually used to produce the train_seed43/44/45 logs. Artifact stays under 16 MB.
kilojoules
added a commit
to kilojoules/parameter-golf
that referenced
this pull request
Apr 21, 2026
….00025 (p<0.001) Previous logs came from a functionally-equivalent packed variant that differed by 37 bytes (contained dead WD_TAPER code). This commit replaces all three seed logs with runs produced by the exact train_gpt.py committed in this folder (Code size: 24,893 bytes in all logs). New stats: - seed 43: 1.02846 (15,999,201 bytes) - seed 44: 1.02812 (15,993,435 bytes) - seed 45: 1.02861 (15,999,551 bytes) - mean 1.02840, std 0.00025 - t-test vs 1.03040: t=13.8, df=2, p<0.001 - Delta vs PR openai#1738 = 0.00700 nats Mean shifted up 0.00073 vs earlier logs (different vast.ai machine, same pytorch 2.5.1+cu124) but std halved, so statistical confidence is stronger.
8 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Record: PR #1735 + CaseOps Tokenizer (V15) — val_bpb 1.03540
Summary
Innovation
Combined two unmerged frontier techniques for the first time:
The integration is non-trivial: PR #1735's
train_gpt.pydid not support byte sidecar files (fineweb_val_bytes_*.bin) which are required for honest BPB when using a tokenizer that adds Unicode private-use control symbols (CaseOps uses \uE001-\uE003 for Title/AllCaps/CapNext).Implementation
Added
load_validation_token_bytes()function and threaded sidecar through three eval functions:eval_val()eval_val_sliding()eval_val_ttt()Each prefers
val_token_bytessidecar when available, falls back to LUT-based byte counting otherwise (zero behavior change for non-CaseOps datasets).Also fixed
load_validation_tokens()to exclude_bytes_files from glob results (prevents double-counting when sidecar shares the directory with token shards).Disabled experimental TTT EMA (separate ablation showed it hurts monotonically-decreasing TTT).
3-Seed Results
Compliance (Issue #1017 Track A)
Attribution
V15 contribution (this PR): non-trivial integration adding byte sidecar support to PR #1735's eval functions, enabling CaseOps tokenizer to drop in without breaking BPB accounting.
Reproduction
See README.md in
records/track_10min_16mb/2026-04-19_SP8192_PreQuantTTT_CaseOps_V15/.