{RECORD} CaseOps pre-quant TTT record (1.0354 BPB)#1911
Open
dttdrv wants to merge 3 commits intoopenai:mainfrom
Open
{RECORD} CaseOps pre-quant TTT record (1.0354 BPB)#1911dttdrv wants to merge 3 commits intoopenai:mainfrom
dttdrv wants to merge 3 commits intoopenai:mainfrom
Conversation
alertcat
added a commit
to alertcat/parameter-golf
that referenced
this pull request
Apr 29, 2026
…ams) After 4 parallel research agents reviewed 30+ open PRs and compliance issues, two new findings: 1. PR openai#1923 (AsymLogit) flagged "empirical negative" by sunnypatneedi 4-29 frontier-scan, BUT only on PR openai#1855 base with default WD=1.0. Never tested on PR openai#1908 + WD=2.0 combo. V19's specific stack is NOT directly invalidated. 2. PR openai#1925 simon-marcus 1.06049 (3-seed verified, vs PR openai#1855 base 1.06108 = -0.00059 BPB). Just 2 hparam env vars: MATRIX_LR 0.026 -> 0.028 PHASED_TTT_PREFIX_DOCS 2500 -> 3500 Orthogonal axis to AsymLogit (LR/TTT prefix vs logit head). Adds two new scout scripts: - run_v19c_stacked_scout.sh: PR openai#1908 + AsymLogit + simon-marcus + WD=2.0 (full stack, recommended first scout) - run_v19b_simonmarcus_scout.sh: PR openai#1908 + simon-marcus + WD=2.0 (ablation if V19c wins partially) Decision rule (CaseOps val baseline 0.97651, community floor 0.0006): V19c < 0.97591 -> CLEAR WIN, run 3-seed V19c 0.97591-0.9755 -> borderline, ablate via V19a/V19b V19c > 0.9755 -> abandon stack, try Lead B (PR openai#1884) Other research findings: - PR openai#1898 SpinQuant flagged regression vs parent openai#1851 (skip) - PR openai#1929 SLOT banned per openai#1722 precedent - PR openai#1911 pre-quant TTT chain banned per openai#1735 precedent - cocohearts 4-28 PR openai#1902 confirmed PR openai#1855 as official openai#1 - regina-openai + Alex Zhao 48h zero activity - CaseOps de-facto legal (PR openai#1855 merged into chain)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR submits the CaseOps V15 record stack for
track_10min_16mb.1.03540487(std0.00056684)1337,42,99915,994,993to15,996,195bytes1337reached1.03459029BPB with a15,996,563byte artifact on 2026-04-28/29{RECORD}because it clears the threshold versus PR Record: SP8192 + Parallel Pre-Quant TTT — val_bpb 1.0429 (3-seed mean) #1735's1.04290BPB resultI am being explicit about the provenance here: this is a community stack, not a "one weird trick" claim. The core move is combining PR #1735's parallel pre-quant TTT stack with PR #1729's CaseOps tokenizer/byte-sidecar path, as integrated in PR #1738, and independently reproducing it.
Why I Did This
The frontier PRs pointed to two large, mostly orthogonal levers:
Those two ideas should compose. CaseOps makes the sequence modeling problem cleaner; pre-quant TTT spends the remaining time budget adapting the full-precision model to that cleaner target before export. The non-trivial integration work is that CaseOps cannot use naive decoded-token byte counting. It needs byte sidecars for honest BPB.
What This PR Adds
This record folder is based on PR #1738's CaseOps V15 integration:
load_validation_token_bytes()eval_val(),eval_val_sliding(), andeval_val_ttt()_bytes_files from token-shard loading to avoid double-countingromeerp/parameter-golf-caseops-v1Results
The previous frontier stack I am comparing against is PR #1735 at
1.04290BPB. This improves it by about0.00750BPB, just over the0.005nats /0.00721BPB threshold.Independent reproduction from this same record folder:
Reproduction checkpoints:
588132ms, step4568/20000val_bpb=1.08389912post-prequant-ttt val_bpb=1.02819756val_bpb=1.04801825val_bpb=1.0345902915,996,563bytesTechnique Stack
mlp_mult=4.0Full Lineage / Credits
I read the upstream PR chain and am intentionally not reducing this to a short credit list. The exact runtime stack is a compressed script, so not every ancestor appears as a neat isolated function anymore, but these are the PRs I traced as leading to this record's components or to the parent PRs used here.
Compliance Notes
This is submitted under the same Track A interpretation as PR #1735 and PR #1738:
The sensitive part is pre-quant TTT on validation chunks. I am not hiding that. I am submitting this consistently with the Track A framing used by PR #1735 / PR #1738: adaptation is part of producing the fixed artifact, and the scorer sees a fixed predictor. If maintainers decide that interpretation is not allowed, this line should be judged consistently with those PRs.
Dependencies / External Data
The challenge README allows packages/imports as long as they do not violate the evaluation, compute, training-time, code-size, or other restrictions, and asks record folders to include dependency/setup notes. I added a
requirements.txtto this record folder for manual setup.For clarity:
romeerp/parameter-golf-caseops-v1is used as the public CaseOps tokenizer/data export for training setup, beforetrain_gpt.pyruns.torch,numpy,sentencepiece, andbrotli; it tries FlashAttention 3 when available in the official H100 image and otherwise falls back to the PyTorch attention path.huggingface-hubandhf_transferare only for fetching the public CaseOps dataset/tokenizer during setup.So no, I am not relying on an external service at eval time. The only external piece is the documented public dataset/tokenizer setup needed to reproduce the training run, in the same spirit as the repository's normal data download flow.
Reproduction
Test Plan
1337,42,9991337reproduction on 2026-04-28/29