Record: PR #1945 base + 2560 long-context + no_qv TTT mask + TTT LR 0.75 + QK_GAIN 5.25 — val_bpb 1.05855 (3-seed mean) by andrewbaggio1 · Pull Request #1953 · openai/parameter-golf

andrewbaggio1 · 2026-04-30T01:00:42Z

Record: PR #1945 base + 2560 long-context + no_qv TTT mask + TTT LR 0.75 + QK_GAIN 5.25 (val_bpb 1.05855)

val_bpb = 1.05855370 (3-seed mean, std 0.00029539) | max artifact 15,992,914 B | 8x H100 SXM | 600s train / 600s eval

Stacks four small, individually validated levers on the exact PR #1945 alertcat V21 record source (which is itself PR #1855 + PR #1908 AWQ-lite + PR #1923 Asymmetric Logit Rescale). Each lever was already measured on prior bases. The contribution here is the orthogonal stack and the production verification.

3-seed Results

Seed	Stop step	Train ms	Pre-quant BPB	Quant no-TTT BPB	Post-TTT BPB	Eval s	Artifact bytes
42	4895	595955	1.06163175	1.06993750	1.05824720	430.0	15,988,861
0	4896	596123	1.06196584	1.07029420	1.05846113	441.5	15,988,757
1234	4916	596130	1.06199757	1.07068689	1.05895276	513.1	15,992,914
Mean	4902	596069	1.06186505	1.07030620	1.05855370	461.5	15,990,177

Population std on final BPB: 0.00029539.

vs current rank 1 (PR #1855 at 1.06108): -0.00253 BPB.
vs PR #1945 reported mean (1.05943381): -0.00088 BPB.
vs merge bar (1.05893): -0.00038 BPB.

All seeds clear the 600s train cap, 600s eval cap, and 16,000,000-byte artifact cap.

What changed vs PR #1945

Four literal-constant additions on top of the exact alertcat V21 source. No new code paths, no new mechanisms, no architectural changes:

EVAL_SEQ_LEN          = 2560     # was 2048
TTT_EVAL_SEQ_LEN      = 2560     # was 2048
TTT_MASK              = no_qv    # was default (Q/V LoRA active)
TTT_Q_LORA            = 0        # disable Q LoRA in TTT
TTT_V_LORA            = 0        # disable V LoRA in TTT
TTT_LOCAL_LR_MULT     = 0.75     # was 1.0
QK_GAIN_INIT          = 5.25     # was 5.0

Everything else is verbatim PR #1945. AWQ-lite, Asymmetric Logit Rescale, CaseOps tokenizer, Polar Express NS, MIN_LR, fused softcapped CE, LQER asymmetric rank-4, sparse attention gate, BOS-fixed SmearGate, phased TTT (3 phases, 2500 prefix docs), per-group lrzip + brotli compression, GPTQ int6 + int7 embeddings.

Why each lever

Each lever was already publicly measured on a closely related base. None alone clears the merge bar. Combined on the PR #1945 base, they compose into a clearing stack.

EVAL_SEQ_LEN=2560 with TTT_MASK=no_qv: extends eval and TTT score-first context past 2048. The baseline #1855 measurement reported 2560 + no_qv at val_bpb 1.06109776 in 473.4s, an improvement of about -0.00058 BPB vs the 2048 anchor at 473.4s eval time. Legal under the 600s eval cap.

TTT_LOCAL_LR_MULT=0.75: scales local LoRA-TTT optimizer LR. The baseline #1855 sweep at 2560 no_qv showed 0.75 was the best multiplier in {0.50, 0.75, 1.00, 1.25, 1.50, 2.00} at val_bpb 1.06104597. Same direction holds here.

QK_GAIN_INIT=5.25: replaces the 5.0 default per-head learnable Q-gain initialization. The baseline #1855 measurement reported QK_GAIN_INIT=5.25 seed-1234 post-TTT at -0.00019364 vs 5.0. Train-time init only.

Asymmetric Logit Rescale via PR #1945 / PR #1923: replaces the single logit_softcap=30.0 with two learnable scalars softcap_pos and softcap_neg, trained inside Phased TTT global SGD. PR #1945 finds Asym is positive when stacked with AWQ-lite due to better TTT recovery. Initialized at the symmetric value (30.0) so eval is identity at start. Inherited from PR #1945.

AWQ-lite mixed precision via PR #1945 / PR #1908: during GPTQ calibration, collect activation RMS per layer, select the most-salient 64-column group, keep that group at int8 inside the GPTQ solve. Inherited from PR #1945.

Compliance (Issue #1017)

Reproduction

SEED=42 \
DATA_PATH=./data/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved \
TOKENIZER_PATH=./data/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model \
CASEOPS_ENABLED=1 \
VOCAB_SIZE=8192 \
ITERATIONS=20000 \
MAX_WALLCLOCK_SECONDS=600 \
TTT_ENABLED=1 \
PHASED_TTT_ENABLED=1 \
PHASED_TTT_NUM_PHASES=3 \
PHASED_TTT_PREFIX_DOCS=2500 \
TTT_LORA_RANK=80 \
TTT_MASK=no_qv \
TTT_Q_LORA=0 \
TTT_V_LORA=0 \
TTT_LOCAL_LR_MULT=0.75 \
EVAL_SEQ_LEN=2560 \
TTT_EVAL_SEQ_LEN=2560 \
QK_GAIN_INIT=5.25 \
MATRIX_LR=0.026 \
MIN_LR=0.1 \
EMBED_BITS=7 \
MATRIX_CLIP_SIGMAS=12.85 \
ATTN_CLIP_SIGMAS=13.0 \
MLP_CLIP_SIGMAS=11.5 \
EMBED_CLIP_SIGMAS=14.0 \
GRAD_CLIP_NORM=0.3 \
FUSED_CE_ENABLED=1 \
SMEAR_GATE_ENABLED=1 \
GATE_WINDOW=12 \
SPARSE_ATTN_GATE_ENABLED=1 \
LQER_ENABLED=1 \
LQER_RANK=4 \
LQER_TOP_K=3 \
LQER_GROUP_SIZE=64 \
LQER_ASYM_ENABLED=1 \
LQER_ASYM_GROUP=64 \
AWQ_LITE_ENABLED=1 \
ASYM_LOGIT_RESCALE=1 \
GPTQ_RESERVE_SECONDS=4.0 \
GPTQ_CALIBRATION_BATCHES=16 \
COMPRESSOR=pergroup \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Repeat for SEED=0 and SEED=1234.

Lineage

This stands on a long chain of prior submissions. The four added levers and the PR #1945 core are all from public PRs:

PR Record: V22 = V21 + PR #1953 levers + EVAL_SEQ_LEN=2816 -- val_bpb 1.05877 (3-seed mean, all strict <600s) #1945 (@alertcat): V21 stack of Record: SP8192 + LQER + Sparse Attn Gate + BOS-Fixed SmearGate + 9-Hparam Greedy Stack — val_bpb 1.06108 (3-seed mean) #1855 + AWQ-lite + Asymmetric Logit Rescale, the direct base for this submission.
PR Record: SP8192 + LQER + Sparse Attn Gate + BOS-Fixed SmearGate + 9-Hparam Greedy Stack — val_bpb 1.06108 (3-seed mean) #1855 (@codemath3000): the BOS-fixed SmearGate + LQER + SparseAttnGate + 9-hparam stack that is the current rank 1.
PR Record: PR #1855 base + activation-aware GPTQ mixed precision - val_bpb 1.06081 (3-seed mean) #1908 (@romeerp): AWQ-lite mixed-precision GPTQ.
PR Record: SP8192 #1855 Base + Asymmetric Logit Rescale + AWQ-lite — val_bpb 1.05971 (3-seed mean, full val) #1923 (@jorge-asenjo): Asymmetric Logit Rescale (originally from modded-nanogpt @classiclarryd PR 1.1190 BPB — 11L LeakyReLU² XSA4 PartialRoPE LNScale EMA ParallelMuon TTT #181).
PR Record: PR #1787 base + Smear Gate + LQER Asym — val_bpb 1.06157 #1797 (@dexhunter): Smear gate + LQER asymmetric int4 lineage.
PR Record: PR #1736 + Polar Express NS + MIN_LR + Sparse Attn Gate + Fused CE + PR #1767 TTT — val_bpb 1.06335 #1787 (@nprime06): Polar Express NS + MIN_LR + Sparse Attn Gate + Fused CE.
PR Record: SP8192 + CaseOps + GatedAttn + QuantGate + Loop45 + PhasedTTT — val_bpb 1.06549 #1736, PR Record: CaseOps Tokenizer + Tapered WD - val_bpb 1.0678 (3-seed mean) #1729 (@dexhunter, @romeerp): CaseOps lossless case transform lineage.
PR RECORD: SmearGate + Attention Output Gate + Legal TTT | val_bpb=1.07139 #1667 (@MarioPaerle): SmearGate + AttnOutGate.
PR Record: Varlen attention + fused MLP + doc-independent TTT (1.07336) #1530, PR Record: VarLenAttn + PhasingTTT - val_bpb 1.0728 (3-seed mean) #1610, PR Record: VarLen Attention + Fused MLP + Multi-Phase Global SGD TTT — val_bpb 1.07193 (3-seed mean) #1626: VarLen + fused MLP + multi-phase phased TTT.
Issue A Field Guide to Valid Submissions #1017 (@cocohearts): the four conditions that define a meaningful val_bpb.

The diagnostic context (long-context score gating at 2560, no_qv mask, TTT_LOCAL_LR_MULT sweep, QK_GAIN_INIT sweep) was originally measured on exact #1855 in private experiments before this stack. None alone cleared the merge bar on #1855. The contribution here is recognizing that they compose orthogonally on the PR #1945 base.

Files

train_gpt.py: training and eval script. Verbatim PR Record: V22 = V21 + PR #1953 levers + EVAL_SEQ_LEN=2816 -- val_bpb 1.05877 (3-seed mean, all strict <600s) #1945 source plus the four literal-constant overrides above.
submission.json: per-seed metadata, BPBs, wallclocks, artifact sizes.
train_seed{42,0,1234}.log: per-seed train and eval logs.

…T LR 0.75 + QK_GAIN 5.25 3-seed mean val_bpb 1.05855370 (std 0.00029539). Clears merge bar 1.05893 by -0.00038 BPB. Improves on PR openai#1855 (1.06108) by -0.00253 BPB. All seeds under 16 MB artifact, 600s train cap, 600s eval cap.

romeerp · 2026-04-30T01:44:40Z

i played around with long-context during eval time earlier, but tried some more complicated things involving dynamic selection that never made it. it's cool to see that just changing the hyperparam for eval seq length was able to improve performance without seeing it in training. really nice result there

…enai#1953 territory. ARTIFACT 142KB OVER CAP.

- Pull PR openai#1953's record dir (train_gpt.py + README) from openai/parameter-golf - Phase U script combines retokenize SP8192 (no HF dataset cached) + 3-seed train using PR openai#1953's exact stack with one new lever: PREFIX_DOCS 2500 -> 2800 - Goal: clear record bar (1.05914) by stacking on top of openai#1953's 1.05855 - Robust: heartbeat, GPU keepalive, no trap-on-EXIT, per-seed HF upload, apt-installs lrzip for pergroup compressor - Aborts seed-1 on BPB > 1.060 OR artifact > 16MB Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

3-seed mean val_bpb = 1.05851479 (std 0.000762, seeds 42/0/1234) on track_10min_16mb. Stack: - PR openai#1945 (alertcat) V21 base = PR openai#1908 + AWQ-Lite + AsymLogit Rescale - PR openai#1953 (andrewbaggio1) TTT/QK env knobs (TTT_LR=0.75, QK_GAIN=5.25, no_qv mask) - PR openai#1948 (TimS-ml + lijuncheng16) LeakyReLU squared slope 0.3 - PR openai#1145 (AnirudhRahul, valerio-endorsed) closed-form n-gram tilt with Σ P=1 Z renormalization Compliance: causal hints, single-pass, Σ P=1 by construction, no SLOT, no n-gram cache, no Pre-Quant TTT. System deps: gcc + lrzip auto-installed by setup.sh; PyTorch 2.9.1 + Triton + Flash Attn 3. One-command reproduction: bash setup.sh SEED={42,0,1234} bash run.sh

Hypothesis: PR openai#1953 verbatim + LeakyReLU2 slope 0.5->0.3 lands at ~1.0578 (3-seed). Skip 2xH100 mini (5-site numeric flip on validated commit). Ladder = 8xH100 official direct, 3 seeds (42, 0, 1234). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

…d-attempt arms Phase V: TTT_OPTIMIZER=muon for LoRA SGD, swap from default AdamW. New env vars TTT_MUON_LR_MULT (default 8x adam) and TTT_MUON_BACKEND_STEPS (default 5). Hypothesis: Newton-Schulz orthogonalized momentum better suits low-rank LoRA. Phase W: TTT_LORA_RANK 80->96, PHASED_TTT_NUM_PHASES 3->4, PHASED_TTT_PREFIX_DOCS 2500->2000 on PR openai#1953 stack. Hypothesis: more LoRA capacity + extra phase boundary captures more cross-doc structure. Artifact-cap risk noted; seed-1 abort guards in script. Both phases run openai#1953 base + their respective lever changes only. Robust pattern (heartbeat, GPU keepalive, no trap-on-EXIT, per-seed HF upload) preserved from Phase U / Phase S debugging. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

250D: PHASED_TTT_PREFIX_DOCS 2500 -> 3000 (~+80s, 6s cap margin — risky) 250E: TTT_LOCAL_LR_MULT 0.75 -> 0.65 (compute-neutral, sub-step around openai#1953 optimum) Both eval-only via TTT_EVAL_ONLY=1 + RESUME_FROM_CKPT on spec 250's final_model.pt. ~9-15 USD per spec for 3 seeds. No code change, no retrain. Family complete: 250B (PREFIX 2750), 250C (PHASES 4), 250D (PREFIX 3000), 250E (LR_MULT 0.65). Independent eval-only sweeps; do not stack.

Final attempt to overtake PR openai#1953 (1.05855) and PR openai#1967 (1.05851). Stack: - V21 base (PR openai#1908 + AWQ-lite + AsymLogit) — your existing record - + PR openai#1953's 7 verified levers (EVAL=2560, no_qv, TTT_LR_MULT=0.75, QK_GAIN=5.25) - + EVAL_SEQ_LEN=2816 (intermediate safe value, ~5% eval timing risk) - All other hparams identical to V21 Safety: EVAL_SEQ_LEN=2816 vs PR openai#1953's 2560 = ~10% eval time penalty. Expected eval times: 470s/485s/564s (PR openai#1953 was 430/441/513). Seed 1234 has thinnest margin (564s of 600s cap = 36s buffer). Expected V22 BPB: 1.0578-1.0586 (3-seed mean) P(beat PR openai#1953 1.05855): ~50% P(beat PR openai#1967 1.05851): ~30-35% (timing-pending PR ahead)

@andrewbaggio1

…mean 1.05877 Layers PR openai#1953 (@andrewbaggio1)'s 7 hparam levers (TTT_MASK=no_qv, TTT_Q_LORA=0, TTT_V_LORA=0, TTT_LOCAL_LR_MULT=0.75, QK_GAIN_INIT=5.25, EVAL_SEQ_LEN, TTT_EVAL_SEQ_LEN) on top of V21 v2 base (PR openai#1908 + AWQ-lite + Asymmetric Logit Rescale + WD=2.0). EVAL_SEQ_LEN raised from PR openai#1953's 2560 to 2816 for longer eval context. 3-seed mean 1.05877 (std 0.00102), all strict <600s train wallclock (596.087-596.152s) and 475-522s eval. Improvement over V21 v2 mean 1.05943 is -0.00066 BPB (matches community 0.0006 floor for meaningful delta). Run on Hyperbolic eu-north-4 Iceland VM (8xH100 SXM5 80GB, PyTorch 2.9.1+cu128 with CUDA 13 forward-compat driver 580). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

# Record: PR openai#1953 stack — no_qv TTT + AWQ-lite + AsymLogit + long-context eval **val_bpb = 1.05847** (3-seed mean, std 0.00063) | **max artifact 15,985,934 bytes** | 8x H100 SXM | strict 600s train + eval ## Results | Seed | Stop step | Train time | Pre-quant BPB | Quantized BPB | **Post-TTT BPB** | Eval time | Artifact bytes | |------|-----------|------------|---------------|---------------|------------------|-----------|----------------| | 42 | 4892 | 595.97s ✅ | 1.06126 | 1.06962 | **1.05788** | 493.2s ✅ | 15,979,342 | | 0 | 4884 | 595.97s ✅ | 1.06181 | 1.07019 | **1.05840** | 420.5s ✅ | 15,979,187 | | 1234 | 4894 | 596.14s ✅ | 1.06232 | 1.07093 | **1.05914** | 428.4s ✅ | 15,985,934 | | **Mean** | **4890** | **596.03s** | **1.06180** | **1.07025** | **1.05847** | **447.4s** | **15,981,488** | vs merged PR openai#1855 (1.06108): **-0.00261 BPB / -0.00571 nats** ## Stack Inherits the full PR openai#1855 base (codemath3000) and layers: 1. **AWQ-lite mixed-precision GPTQ** (PR openai#1908, romeerp) — activation-aware salient-group int8 promotion 2. **Asymmetric Logit Rescale** (PR openai#1923, jorge-asenjo) — learnable pos/neg softcap during TTT eval 3. **no_qv TTT mask** (PR openai#1953, himanshudongre) — disable Q/V LoRA in TTT, keep K/MLP/O 4. **TTT_LOCAL_LR_MULT=0.75** — scaled TTT optimizer LR 5. **QK_GAIN_INIT=5.25** — per-head Q-gain initialization 6. **EVAL_SEQ_LEN=2560** — extended eval context 7. **PHASED_TTT_PREFIX_DOCS=3000** — larger global-TTT prefix 8. **TTT_LORA_RANK=56** — reduced LoRA rank (compute reallocation) ## Compliance - [x] Artifact under 16,000,000 bytes (max 15,985,934) - [x] Train wallclock under 600s (max 596.14s) - [x] Eval wallclock under 600s (max 493.2s) - [x] No PPM, no SLOT, no pre-quant TTT, no n-gram cache - [x] Single left-to-right pass, score-before-update - [x] Full normalized softmax distribution ## Reproduction ```bash apt-get install -y lrzip pip install sentencepiece brotli huggingface_hub numpy python-minifier pip install --no-deps flash_attn_3 --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/ # Dataset HF_HUB_ENABLE_HF_TRANSFER=1 python3 -c " from huggingface_hub import snapshot_download snapshot_download(repo_id='romeerp/parameter-golf-caseops-v1', repo_type='dataset', local_dir='/workspace/caseops_data') " # Run for SEED in 42 0 1234; do SEED=$SEED \ DATA_PATH=/workspace/caseops_data/datasets/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved \ TOKENIZER_PATH=/workspace/caseops_data/datasets/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model \ CASEOPS_ENABLED=1 VOCAB_SIZE=8192 MAX_WALLCLOCK_SECONDS=600 VAL_LOSS_EVERY=0 \ SMEAR_GATE_ENABLED=1 GATE_WINDOW=12 SPARSE_ATTN_GATE_ENABLED=1 SPARSE_ATTN_GATE_SCALE=0.5 \ GATED_ATTN_QUANT_GATE=1 FUSED_CE_ENABLED=1 QK_GAIN_INIT=5.25 \ EMBED_BITS=7 MATRIX_CLIP_SIGMAS=12.85 ATTN_CLIP_SIGMAS=13.0 MLP_CLIP_SIGMAS=11.5 EMBED_CLIP_SIGMAS=14.0 \ GPTQ_RESERVE_SECONDS=4.0 GPTQ_CALIBRATION_BATCHES=16 COMPRESSOR=pergroup \ LQER_ENABLED=1 LQER_ASYM_ENABLED=1 LQER_RANK=4 LQER_FACTOR_BITS=4 LQER_ASYM_GROUP=64 LQER_TOP_K=3 \ AWQ_LITE_ENABLED=1 AWQ_LITE_BITS=8 AWQ_LITE_GROUP_TOP_K=1 AWQ_LITE_GROUP_SIZE=64 \ ASYM_LOGIT_RESCALE=1 \ TTT_ENABLED=1 PHASED_TTT_ENABLED=1 PHASED_TTT_NUM_PHASES=3 PHASED_TTT_PREFIX_DOCS=3000 \ TTT_LORA_RANK=56 TTT_MASK=no_qv TTT_Q_LORA=0 TTT_V_LORA=0 TTT_LOCAL_LR_MULT=0.75 \ TTT_CHUNK_SIZE=48 TTT_BETA2=0.99 TTT_WEIGHT_DECAY=0.5 \ EVAL_SEQ_LEN=2560 TTT_EVAL_SEQ_LEN=2560 \ WARMDOWN_FRAC=0.85 BETA2=0.99 GRAD_CLIP_NORM=0.3 MIN_LR=0.1 MATRIX_LR=0.026 \ NCCL_NET=Socket GLOBAL_TTT_MOMENTUM=0.9 \ torchrun --standalone --nproc_per_node=8 train_gpt.py > train_seed${SEED}.log 2>&1 done ```

Pull PR openai#2014's record dir from openai/parameter-golf and reproduce its 1.05759 3-seed mean. Key new levers vs openai#1953: EVAL_SEQ_LEN=3072, train_seq_schedule 1024->2048->3072, single-phase TTT (NUM_PHASES=1, PREFIX=2500), short-doc score-first chunking (TTT_SHORT_SCORE_FIRST_STEPS=256:8,2000:24). Even with our infra's ~1.5-2 milli-BPB inflation pattern, reproducing openai#2014 should land ~1.0590 — close enough to record bar to potentially clear it. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

…ean 1.05831 BPB Clears record bar (1.05914) by 0.83 milli-BPB. Welch t = -6.49 vs PR openai#1855 (1.06108), p < 0.0001. All 3 seeds produce 15.99 MB artifacts under the 16 MB cap, all under the 600s wallclock budget. Per-seed: - 42: ttt=1.05793 art=15,986,149 eval=572.6s - 314: ttt=1.05852 art=15,987,257 eval=553.7s - 1234: ttt=1.05849 art=15,989,895 eval=574.1s Submission directory at records/track_10min_16mb/2026-04-30_PR2014_Reproduction_1.0583/ contains PR openai#2014's verbatim train_gpt.py + tokenizer + our seed_results.csv + a detailed README documenting the lineage (openai#1797 -> openai#1851 -> openai#1855 -> openai#1908 -> openai#1923 -> openai#1953 -> openai#2014), the new levers vs each parent, and the full 4-condition C1-C4 legality check. submission.json author/github_id are placeholders pending the user's choice of submitting account. Reproduction script: runpod/phase_x_pr2014.sh — runs end-to-end on a single 8xH100 SXM pod (~2.5h wall, ~$66 cost). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

@dexhunter

Audits every CaseOps-lineage record-track PR (merged + unmerged) since 2026-04-18 for whether val docs are also in the training set. Working set: 34 PRs (31 from chronological seed list + 3 discovered ancestors: openai#1908, openai#1923, openai#2007). Boundary nodes openai#1493 / openai#1626 (pre-CaseOps). Verdicts: - CLEAN (8): openai#1729, openai#1851, openai#1868, openai#1908, openai#2019, openai#2027, openai#2031, openai#2068 - LEAK (25): openai#1736 (our research baseline) → openai#1769 → openai#1787 → openai#1797 → openai#1855 → V21 family (openai#1945, openai#1923, openai#1953, openai#1967) → openai#2018 → openai#2118 (current claimed frontier 1.04350), plus siblings. - INHERIT (1): openai#2050 (eval-only on frozen openai#1915) Code-level evidence (not README claims): - Every shipped prepare_caseops_data.py is byte-identical: SHARD_TOKENS=10_000_000, default=10_000 for --val-docs - NO PR overrides --val-docs (searched all .sh files in all 34 PRs) - cached_challenge_fineweb.py downloads from romeerp/parameter-golf-caseops-v1 HF dataset whose manifest pins docs_val=50000, docs_train=8181945, sums match → CLEAN by construction - PR openai#2018's DATASET_AUDIT.md is gold-standard explicit leak description - PR openai#2118's submission.json admits "--val-docs=10000 train shards + 50k val eval" Three signposts: - Leak introduced: PR openai#1736 by @dexhunter (Apr 19) — first prepare_caseops_data.py default invocation - Leak fixed: PR openai#1851 by @aquariouseworkman (Apr 27) — switched to HF dataset - Leak re-introduced: PR openai#1855 by @codemath3000 (same day) — rebuilt locally The merged-leaderboard SOTA (openai#1851/openai#1868 at 1.06128/1.06141) is CLEAN. The unmerged frontier (openai#2118 at 1.04350) is LEAK. The 0.018 bpb gap is inflated by val memorization; spec 301 was designed to measure how much remains under clean data. Files: caseops-memory-leakage/README.md — overview, methodology, takeaways caseops-memory-leakage/verdicts.md — 34-row master table with evidence caseops-memory-leakage/family-tree.md — ASCII trees with [C]/[L] annotations

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

…ence After user feedback that LEAK calls relied too heavily on lineage-inheritance and path heuristics, applied stricter criterion: a LEAK verdict requires at least one of (a) explicit shell-script invocation of prepare_caseops_data.py without --val-docs=50000, (b) README "Data setup" matching actual train log path, (c) audit/submission.json admission text, (d) train log path with `_caseops/datasets/datasets/<name>` triple-nesting OR single `<root>/datasets/<name>` (which only local prep produces; HF always gives double-nesting). Records that previously got LEAK by lineage-inheritance alone are now AMBIGUOUS unless they meet at least one of those tests. Changes: - openai#1945 LEAK → CLEAN (finalize_v18.sh has snapshot_download from HF; actual run path matches HF target; README's prepare_caseops_data.py section is stale documentation) - openai#1953 LEAK → AMBIGUOUS (PR ships only train_gpt.py + logs; no prep evidence; path matches HF target; parent openai#1945 confirmed CLEAN — leans CLEAN but no direct PR evidence) - openai#2041 LEAK → AMBIGUOUS (no prep invocation; double-nested path consistent with EITHER HF or local prep) - openai#2075 LEAK → AMBIGUOUS (ships prep file but no explicit invocation; path matches HF target) Updated tally: CLEAN 9, LEAK 21, AMBIGUOUS 3, INHERIT 1 (was 8/25/0/1). Headline impact: realistic clean SOTA is at most ~0.012 bpb below the claimed frontier openai#2118 (1.04350). Best clean BPB candidates in order: openai#2019 1.05847 (HF, confirmed) openai#1953 1.05855 (AMBIGUOUS, leans CLEAN) openai#1945 1.05943 (HF, confirmed via re-audit) openai#2031 1.05985 (HF, confirmed) openai#1908 1.06081 (HF, confirmed) openai#1851 1.06128 (HF, MERGED SOTA)

someone114514 mentioned this pull request Apr 30, 2026

SP8192 + LongCtx NoQV QK5.25 Prefix2750 — 1.05827 BPB (seed 42) #1963

Open

himanshudongre mentioned this pull request Apr 30, 2026

Record candidate: long-context no-QV rank56/prefix3000 TTT — val_bpb 1.05875 #1965

Open

TanishGudise added a commit to TanishGudise/parameter-golf that referenced this pull request Apr 30, 2026

S35 = 1.05883 (S34 + TTT_LORA_LR=7.5e-5) - SPRINT BEST. Approaches op…

65ebdbf

…enai#1953 territory. ARTIFACT 142KB OVER CAP.

ndokutovich mentioned this pull request Apr 30, 2026

Record: V21 + N-gram Tilt + LeakyReLU 0.3 — val_bpb 1.05851 (3-seed mean) #1967

Open

EthanYangTW mentioned this pull request Apr 30, 2026

Non-record: GolfParty — composable scaffolding for every Requests-for-PRs item #1978

Open

6 tasks

jamesEmerson112 mentioned this pull request Apr 30, 2026

Record: SP8192 Full Stack + Headwise Gated Attention + PreQuantTTT (1.0511 BPB, 3-seed) #1992

Closed

alertcat mentioned this pull request Apr 30, 2026

Record: V22 = V21 + PR #1953 levers + EVAL_SEQ_LEN=2816 -- val_bpb 1.05877 (3-seed mean, all strict <600s) #1945

Open

12 tasks

simonbissonnette mentioned this pull request Apr 30, 2026

Record: PR1855/PR1953 base + Progressive context growth (val_bpb: 1.05759, 3-seed) #2014

Open

simon-marcus mentioned this pull request Apr 30, 2026

Record: Gated XSA + LQER top-1 + strict token-only n-gram TTT (val_bpb: 1.047) #2018

Closed

aquariouseworkman mentioned this pull request Apr 30, 2026

Record: 1.05847 no_qv TTT + AWQ-lite + AsymLogit + long-context eval (3-seed) #2019

Open

6 tasks

jorge-asenjo mentioned this pull request Apr 30, 2026

Record: SP8192 V21 + Inside-timer N-gram TTT (no Gated XSA) — val_bpb 1.05692 (3-seed mean) #2041

Open

8 tasks

ZanePeycke mentioned this pull request Apr 30, 2026

Add LR0.85 prefix2750 legal TTT record #2047

Open

12 tasks

dexhunter mentioned this pull request May 1, 2026

Record: PR #1908 base + GPTQ module-damp + Asym Logit Rescale — val_bpb 1.06048 (3-seed mean) #2051

Closed

himanshudongre mentioned this pull request May 1, 2026

Non-record: competition research notes #2111

Open

dexhunter mentioned this pull request May 1, 2026

Non-record: PR1953 K+O-only TTT + QK_GAIN_INIT=5.35 #2119

Open

This was referenced May 1, 2026

Record: CaseOps Gated XSA NgramTilt LQER | val_bpb=1.05933439 #2123

Closed

Record : CaseOps Gated XSA NgramTilt LQER | val_bpb=1.05933439 #2124

Open

aquariouseworkman mentioned this pull request May 1, 2026

Record [corrected] : 1.05770 Gated XSA + token-only n-gram tilt + LQER top-1 + AWQ-lite + AsymLogit) with GPTQ_RESERVE_SECONDS=2.0 and corrected CaseOps data preparation #2118

Open

vaibhavmishra1 mentioned this pull request May 1, 2026

Record : CaseOps Gated XSA NgramTilt LQER | val_bpb=1.05933439 vaibhavmishra1/parameter-golf#1

Merged

Clarify canonical HF CaseOps data source

5a47da6

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

leon2k2k2k mentioned this pull request May 1, 2026

Train/val data leakage in CaseOps records — prepare_caseops_data.py default overlaps 80% of val docs with training data #2127

Open

cocohearts mentioned this pull request May 2, 2026

Update leaderboard with May 1 audited rows #2146

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: PR #1945 base + 2560 long-context + no_qv TTT mask + TTT LR 0.75 + QK_GAIN 5.25 — val_bpb 1.05855 (3-seed mean)#1953

Record: PR #1945 base + 2560 long-context + no_qv TTT mask + TTT LR 0.75 + QK_GAIN 5.25 — val_bpb 1.05855 (3-seed mean)#1953
andrewbaggio1 wants to merge 2 commits intoopenai:mainfrom
andrewbaggio1:record/longctx-noqv-qk525-on-1945-1.0586

andrewbaggio1 commented Apr 30, 2026

Uh oh!

romeerp commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

andrewbaggio1 commented Apr 30, 2026

Record: PR #1945 base + 2560 long-context + no_qv TTT mask + TTT LR 0.75 + QK_GAIN 5.25 (val_bpb 1.05855)

3-seed Results

What changed vs PR #1945

Why each lever

Compliance (Issue #1017)

Reproduction

Lineage

Files

Uh oh!

romeerp commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants