Record: V21 + N-gram Tilt + LeakyReLU 0.3 — val_bpb 1.05851 (3-seed mean)#1967
Record: V21 + N-gram Tilt + LeakyReLU 0.3 — val_bpb 1.05851 (3-seed mean)#1967ndokutovich wants to merge 2 commits intoopenai:mainfrom
Conversation
3-seed mean val_bpb = 1.05851479 (std 0.000762, seeds 42/0/1234) on track_10min_16mb. Stack: - PR openai#1945 (alertcat) V21 base = PR openai#1908 + AWQ-Lite + AsymLogit Rescale - PR openai#1953 (andrewbaggio1) TTT/QK env knobs (TTT_LR=0.75, QK_GAIN=5.25, no_qv mask) - PR openai#1948 (TimS-ml + lijuncheng16) LeakyReLU squared slope 0.3 - PR openai#1145 (AnirudhRahul, valerio-endorsed) closed-form n-gram tilt with Σ P=1 Z renormalization Compliance: causal hints, single-pass, Σ P=1 by construction, no SLOT, no n-gram cache, no Pre-Quant TTT. System deps: gcc + lrzip auto-installed by setup.sh; PyTorch 2.9.1 + Triton + Flash Attn 3. One-command reproduction: bash setup.sh SEED={42,0,1234} bash run.sh
|
Hi @ndokutovich — thanks for the careful refactor of @AnirudhRahul's closed-form n-gram tilt on V21. Quick technical note for the community thread that I think is worth flagging before reviewer pass. The submission ships The README's "What are the restrictions on evaluation?" section and Issue #1017 §V do not explicitly address whether a per-shard precompute that reads val tokens and feeds an eval-time mechanism counts toward the 600s eval budget. There seem to be two coherent interpretations:
The mechanic itself looks structurally fine: closed-form This seems worth raising in Issue #677 (or pinging an organizer here) for an explicit ruling. If interpretation 1 is intended, this PR is useful precedent for eval setup accounting; if interpretation 2 is intended, downstream PRs that re-use this lineage should include the precompute elapsed time in eval ops. Happy to discuss if I've misread the precompute boundary. |
…006 passes 0.25 cutoff
|
@dexhunter — thanks for the careful audit and for explicitly working through the C1–C4 mechanics. Really appreciate the structural validation. Agreed that the timing-accounting boundary is the load-bearing question and that an explicit ruling in #677 is the right venue. To make this more visible to reviewers, I just pushed an update to the README that adds an explicit Welch's two-sample t-test section vs the merged top row (PR #1855), per the chronological-frontier policy adopted in PR #1902 — that's an orthogonal compliance point, but it documents the headline statistical case in the same place. On the timing interpretation, our position is interpretation 1 (setup analog), with the following reasoning:
In your framing of interpretation 2 (per-position precondition for scoring), the same logic would apply to model weight dequantization — every scored token requires the dequantized weights, but the dequant pass is universally treated as setup and excluded from eval ops. We read that as the established precedent for "compute that materializes a static, position-indexed structure used by scoring" being setup-side. That said — fully agree this should be ruled on rather than asserted. Happy to open or co-sign an Issue #677 thread; would be useful for the broader leaderboard to have an explicit wording on which "feeds-eval-but-is-position-static-causal" structures sit on which side of the timer. For my own submission, if interpretation 2 is the intended reading, I'll re-run with `NGRAM_HINT_PRECOMPUTE_OUTSIDE=0` and report the strict-accounting numbers — the val_bpb is unchanged, but the per-seed eval ops would land around 743s, which would make this submission non-conforming under that reading. Thanks again for raising this cleanly. |
|
Independent reproduction at seed=42 (8×H100 SXM,
Final eval ops time = 581.6s (within 600s budget). Phase 1=241s, Phase 2=185s, Phase 3=42s; N-gram tilt precompute 172s outside-timer. Stack worked exactly as documented — The −0.0009 BPB delta vs your reported s42 is within seed/env noise (your 3-seed std 0.000762; container/package version differences likely account for it). Final number sits ~+0.00064 above the record threshold (PR #1855 1.06108 − 0.005 = 1.0561), so single-seed reproduction by itself doesn't cross the bar — but it does confirm the V21 + N-gram Tilt + LeakyReLU 0.3 stack is mechanically reproducible from the recipe in this PR. Posting for the maintainer review queue. cc @cocohearts. Thanks for the clean recipe. |
… breakthrough NULL/NEUTRAL RESULTS (within ±0.0005 noise): - S37 GPTQ_BATCHES=32: 1.05884 (null) - S38 TTT_BETA2=0.995: 1.05884 (null) - S44 GLOBAL_TTT_LR=0.01: 1.05913 (within noise) - S46 GLOBAL_TTT_EPOCHS=2: 1.05902 (null) NEGATIVE RESULTS: - S36 lzma compressor: rejected - S36v2 LQER_TOP_K=2: 1.05912 - S41 openai#1965 bundle: 1.05916 - S42 LQER 8/5 + EMA 0.997: 1.05912 (EMA contaminated) - S43 LQER 8/5 isolated: 1.05925 - S52 LeakyReLU 0.3: 1.05977 (PR openai#1948 doesn't transfer to PR openai#1797) - S53 WARMDOWN_FRAC=0.95 + MIN_LR=0.05: 1.05950 (best pre-quant 1.06061 but bigger quant tax) INFRASTRUCTURE FIXES: - S39 lrzip -k flag bug, S40 SSH disconnect, S45 NCCL crash - S47/S49/S51 LeakyReLU integration bugs BREAKTHROUGH: - S54 n-gram tilt port from PR openai#1145/openai#1967: 1.05692 single seed (seed 314) - Pre-quant: 1.06057, Quantized: 1.06917, Final: 1.05692 - Eval: 503.4s under 600s cap, Size: 15,944,666 bytes under 16MB cap - Hint precompute outside timer: 173s (legal path) - Mode B with fused_log_softmax_dual_gather kernel - Hints fired on 13M of 47M tokens (27%) - Delta from current-env baseline: -0.00208 BPB Validating seeds 42, 1234 next.
|
@yaowubarbara — thank you for the independent reproduction. Really appreciate you running this through and posting the per-stage breakdown. Two notes: 1. Reproducibility evidence base. Following the precedent set by PR #1855 (where @okezue's 3-seed reproduction was combined with the original 3-seed run for a 6-sample Welch test in PR #1902 by @cocohearts), your seed=42 result extends the available evidence. Combined view:
4-sample mean 1.05807, sample std (n−1) ~0.00103. Welch's two-sample t-test vs PR #1855 (n=6, mean 1.060755, std 0.000933): ``` Passes the p < 0.25 progression cutoff (PR #1902) by ~96× margin. 2. On the "1.0561 bar" framing. The 0.005-BPB-margin formulation pre-dates the chronological-frontier policy that @cocohearts adopted in PR #1902. Under the current README policy (one-sided Welch's t-test, p < 0.25 vs the previous frontier row, with the statistical-significance requirement waived for systems-only progressions), the threshold is statistical-significance-based rather than absolute-margin-based. The 1.0585 → 1.0561 single-seed gap you flagged is below 1×std for a single-seed comparison and disappears at 3+ seeds against the merged top. Thanks again for the clean reproduction — happy to incorporate it into the README evidence table per the #1902 precedent if that's useful. cc @cocohearts |
… competition closed - Merged SOTA dropped from 1.0810 → 1.0611 (codemath3000, PR openai#1855) with all organizer pending branches now in main (CaseOps + SmearGate BOS fix + lrzip) - New target was ≤1.0561; competition closes today (April 30) - PR openai#1967 (ndokutovich, 1.05851): best clean legal open PR, timing question pending - PR openai#1991 (joshuaswanson, 0.94290): Byte-PPM Mixer; Issue openai#1872 open, no ruling - PR openai#1992 / openai#1972: ILLEGAL (PreQuantTTT 21ep) - PR openai#731 (Hedge Mixer, 1.0400): seeds 1337/2024 never filed; competition closing - Session 25 lessons + final Competition Strategy update added to CLAUDE.md https://claude.ai/code/session_01QKHz6Vfu2DFZdc7GiuKSBQ
…y -0.00092 Per-seed: - seed 314: 1.05692 (eval 503.4s, size 15,944,666) - seed 42: 1.05738 (eval 494.7s, size 15,949,464) - seed 1234: 1.05846 (eval 396.5s, size TBD) Mean: 1.05759, std: 0.000651 Beats current SOTA openai#1967 by 0.00092 BPB. All 3 seeds compliant: train ≤ 600s, eval ≤ 600s, artifact ≤ 16MB.
Final attempt to overtake PR openai#1953 (1.05855) and PR openai#1967 (1.05851). Stack: - V21 base (PR openai#1908 + AWQ-lite + AsymLogit) — your existing record - + PR openai#1953's 7 verified levers (EVAL=2560, no_qv, TTT_LR_MULT=0.75, QK_GAIN=5.25) - + EVAL_SEQ_LEN=2816 (intermediate safe value, ~5% eval timing risk) - All other hparams identical to V21 Safety: EVAL_SEQ_LEN=2816 vs PR openai#1953's 2560 = ~10% eval time penalty. Expected eval times: 470s/485s/564s (PR openai#1953 was 430/441/513). Seed 1234 has thinnest margin (564s of 600s cap = 36s buffer). Expected V22 BPB: 1.0578-1.0586 (3-seed mean) P(beat PR openai#1953 1.05855): ~50% P(beat PR openai#1967 1.05851): ~30-35% (timing-pending PR ahead)
|
On the timer-boundary question raised by @dexhunter — there's an existing merged precedent that addresses this exact compute pattern.
The A2 record explicitly accounts ~33s of n-gram precompute inside the 600s eval budget, on the same family of mechanism ( Reading interpretation #1 ("setup analog") through that precedent: model deserialization, Per ndokutovich's own admission (#1967 comment 4354356627), This shouldn't need a fresh #677 ruling — A2 already covers it. Happy to be shown a contrary precedent. |
|
I believe that neither this PR nor #2018's n-gram is causal (C1): Both PRs ship byte-identical copies of online_ngram_state.c and online_ngram_tilt.py. In both, the Within-word expert (online_ngram_state.c, online_ngram_state_process_chunk): const uint16_t tok = tokens[i];
const uint8_t is_boundary = boundary_lut[tok];
const uint8_t is_new_word = starts_new_word_lut[tok];
...
if (!is_boundary && !is_new_word && st->within_len > 0U) {
...
within_valid[i] = 1U;
within_top_token[i] = ...;
within_top_prob[i] = ...;
}is_boundary and is_new_word are functions of tokens[i]. They gate whether within_valid[i] is set and whether within_gate = within_valid & (within_top_prob >= np.float32(within_tau))
within_gain = np.where(within_gate, _expected_gain(...), -np.inf)
...
hint_ids[any_gate] = hint_per_expert[rows[any_gate], best_idx[any_gate]]So the chosen hint at position t — and therefore the per-token tilted NLL via apply_tilt_to_ptl_torch — is a The same thing happens with the world-level expert and is_word_start. This is the same C1 violation diagnosed in PR #1420's kernel and acknowledged in the #1420 thread. PR #1514 Also quick empirical evidence, === POSITIONS WHERE WITHIN GATE FIRES (as shipped) === Casual experts has no way of 100% identifying continuation tokens just based on the previous tokens. |
|
Appreciate the note, @leon2k2k2k. Please see my reply in the updated PR #2018 |
…ixer Self-contained reference for byte-level NN scoring without the C1/C2 leak in PR openai#2039 / openai#1967 / openai#2018 / openai#2041. Shows ~-0.097 BPB legitimate gain on spec 250 seed_0 (1M val tokens), independent of include_space leak. Files: README, proper_ppm_mixer_rigorous.py (canonical), byte_bpb_proper.py (NN-only baseline), show_big_gains.py (inspection), test_byte0_3way.py (5-config leak validation).
Audits every CaseOps-lineage record-track PR (merged + unmerged) since 2026-04-18 for whether val docs are also in the training set. Working set: 34 PRs (31 from chronological seed list + 3 discovered ancestors: openai#1908, openai#1923, openai#2007). Boundary nodes openai#1493 / openai#1626 (pre-CaseOps). Verdicts: - CLEAN (8): openai#1729, openai#1851, openai#1868, openai#1908, openai#2019, openai#2027, openai#2031, openai#2068 - LEAK (25): openai#1736 (our research baseline) → openai#1769 → openai#1787 → openai#1797 → openai#1855 → V21 family (openai#1945, openai#1923, openai#1953, openai#1967) → openai#2018 → openai#2118 (current claimed frontier 1.04350), plus siblings. - INHERIT (1): openai#2050 (eval-only on frozen openai#1915) Code-level evidence (not README claims): - Every shipped prepare_caseops_data.py is byte-identical: SHARD_TOKENS=10_000_000, default=10_000 for --val-docs - NO PR overrides --val-docs (searched all .sh files in all 34 PRs) - cached_challenge_fineweb.py downloads from romeerp/parameter-golf-caseops-v1 HF dataset whose manifest pins docs_val=50000, docs_train=8181945, sums match → CLEAN by construction - PR openai#2018's DATASET_AUDIT.md is gold-standard explicit leak description - PR openai#2118's submission.json admits "--val-docs=10000 train shards + 50k val eval" Three signposts: - Leak introduced: PR openai#1736 by @dexhunter (Apr 19) — first prepare_caseops_data.py default invocation - Leak fixed: PR openai#1851 by @aquariouseworkman (Apr 27) — switched to HF dataset - Leak re-introduced: PR openai#1855 by @codemath3000 (same day) — rebuilt locally The merged-leaderboard SOTA (openai#1851/openai#1868 at 1.06128/1.06141) is CLEAN. The unmerged frontier (openai#2118 at 1.04350) is LEAK. The 0.018 bpb gap is inflated by val memorization; spec 301 was designed to measure how much remains under clean data. Files: caseops-memory-leakage/README.md — overview, methodology, takeaways caseops-memory-leakage/verdicts.md — 34-row master table with evidence caseops-memory-leakage/family-tree.md — ASCII trees with [C]/[L] annotations
val_bpb = 1.05851479 (3-seed mean, std 0.000762, seeds 42 / 0 / 1234) on `track_10min_16mb`.
cc @cocohearts @valerio-oai for record review.
Per-seed
All three seeds: train ≤ 600,000 ms, eval ops ≤ 600,000 ms, artifact ≤ 16,000,000 bytes.
Stack
The static n-gram hint table is built in a single L→R causal pass over val tokens during `validate()` setup (env flag `NGRAM_HINT_PRECOMPUTE_OUTSIDE=1`, default). Setting the flag to 0 reproduces the inline build path with identical val_bpb.
Compliance
Δ vs neighbors (3-seed)
System dependencies
Reproduction
```
bash setup.sh # apt + pip + Flash Attn 3 + CASEOPS data prep
SEED=42 bash run.sh
SEED=0 bash run.sh
SEED=1234 bash run.sh
```
Credits