Record candidate: PR #2130 base + GPTQ_CALIBRATION_BATCHES=32 — val_bpb 1.05651 (3-seed mean)#2135
Record candidate: PR #2130 base + GPTQ_CALIBRATION_BATCHES=32 — val_bpb 1.05651 (3-seed mean)#2135codemath3000 wants to merge 3 commits intoopenai:mainfrom
Conversation
…es, paper scan Post-deadline PR activity: PR openai#2138 Lock-In Byte Mixer confirmed BPB bug (corrected ~1.0671, not 0.979556); PR openai#2135 codemath3000 1.05651 narrowly misses 0.005 threshold; PR openai#2139 TTT Peer-LoRA Ensemble novel technique; PR openai#2140 flagged for target-token n-gram gating violation. New papers: BBQ quantization (ICLR 2026, arXiv:2603.01599), EntroLLM (2505.02380), In-Place TTT NTP-aligned (2604.06169). https://claude.ai/code/session_01CxuVyZaKMxMMc8Q4sMb2dF
|
Update after applying the grace policy: this is now included in leaderboard PR #2146. The PR/code/scaffold were opened before cutoff, and the post-cutoff commits supplied the run logs/results. I also verified the submitted logs use clean canonical CaseOps data ( |
|
@cocohearts Thanks so much for taking the time to review this. The "scored state must be in by cutoff" criterion isn't in the README. The official README sets the rule: "Since all submissions are public, we're accepting record submissions chronologically depending on their PR creation time." That pins ordering on PR creation, not on when content reached completion. PR #2135 was opened at 2026-05-01 23:48:57Z (4:48 PM PT, ~11 min before the 5 PM PT cutoff), so it satisfies that test. The pre-cutoff commit (be7420d) shipped the full code surface: train_gpt.py, lossless_caps.py, online_ngram_state.c, online_ngram_tilt.py, prepare_caseops_data.py, the tokenizer .model, requirements.txt, and scaffold README/submission.json. The two post-cutoff commits (f086a9f, ff90522) only added the train_seed*.log run outputs and filled result numerics into README.md and submission.json. No methodology, architecture, training script, or tokenizer changed post-cutoff. The seed log files added afterward are run outputs of the pre-cutoff code, not new submission content. Anyone with 8xH100 SXM access could clone be7420d and regenerate the runs within run-to-run noise. What was "missing" at cutoff wasn't the submission; it was the empirical measurement of a submission that was already public and reproducible. Merged precedent for post-creation supplementation: PR #1851 was supplemented via PR #1868, after PR #1855 had already taken over the leaderboard. The combined submission was still accepted as part of the leaderboard record. Applied consistently, the "content complete by cutoff" rule would retroactively reject that combined submission (already part of the merged leaderboard): PR #1851's scored 3-seed state was completed by PR #1868 well after #1851's own creation and after PR #1855 had displaced it. Thanks so much again for taking the time to review. |
…AsymLogit Rescale, 7th BPB bug - logs/daily_research.md: new May 3 entry; DRAFT PR openai#2146 grace-policy audit adds 4 records (pending SOTA 1.05651 via PR openai#2135); AsymLogit Rescale documented (~5 lines, zero legality risk); PR openai#2124 seed/config inconsistency; PR openai#2138 BPB bug #7 confirmed; data overlap hazard in PR openai#2130 flagged; no new high-relevance papers beyond prior scan. - CLAUDE.md: Competition Strategy updated to reflect closed competition, pending audit status, and key post-competition findings (AsymLogit Rescale, GPTQ calibration batches, data overlap isolation requirement). https://claude.ai/code/session_013Q2rFE4xRHRRYaSPfzCiip
Record candidate: PR #1797 base + Token-only n-gram tilt + AsymLogit Rescale + #2060 levers + NUM_PHASES=1 + GPTQ_CALIBRATION_BATCHES=32
val_bpb: 1.05651 (3-seed mean, std 0.00036) | val_loss: 2.31203 nats (std 0.00078) | 15.95 MB max | 8xH100 SXM | 600s train / 600s eval
Improvement over merged PR #1855 leaderboard record (1.06107587 BPB):
-0.00457 BPB / -0.01000 nats
The -0.01000-nat improvement clears the README's record-acceptance threshold of 0.005 nats by approximately 2x.
This submission keeps the PR #2130 architecture/training stack identical and only changes one knob: GPTQ_CALIBRATION_BATCHES=32 (PR #2130 used 16). Every other hyperparameter, env var, and code path is byte-for-byte the PR #2130 reproduction command.
Note on submission timing
This PR was opened at 2026-05-01 (before the deadline) with the full code, environment, and submission scaffold in place. The per-seed run logs and result numbers were filled in afterward. The official README makes the policy explicit: "we're accepting record submissions chronologically depending on their PR creation time."
The PR #1851 / PR #1868 pair is a direct precedent. PR #1851 was opened with incomplete seed results (the headline value sat between the then-current SOTA and the next eventual record, PR #1855). PR #1868 later filled in the missing seed data for the same submission. When the package was evaluated, it was treated as record-relevant on PR #1851's chronological standing, not PR #1868's: the original PR's creation time was the dispositive timestamp, even though the data needed to substantiate the claim arrived later. The same principle applies here: PR creation time, not results-posting time, is what establishes chronological standing. Under this rule, the present submission counts as filed before the deadline and is therefore valid.
Results
3-seed sample std: 0.00035573 BPB / 0.00077803 nats.
All seeds under the 16,000,000-byte artifact cap and 600s train/eval budgets. Maximum artifact is 15,947,490 bytes and the maximum eval pass is 567.1s.
Statistical significance vs PR #2130 (immediate prior candidate)
Both PRs were evaluated on the same seed set {0, 42, 314}. With identical seeds on both sides, the paired t-test is the appropriate comparison since each seed pair shares its random initialization, removing the seed-to-seed component of the variance and giving the test the most power. For completeness, we also report Welch's t-test (unbiased σ) treating the two sets as independent samples; both tests clear the threshold.
Per-seed post-TTT BPB (full precision, from each PR's run logs):
We tie or beat PR #2130 on every seed.
What is new in this submission
A single hyperparameter change on the PR #2130 stack: GPTQ_CALIBRATION_BATCHES, increased from 16 to 32. This doubles the number of GPTQ Hessian calibration batches at the (still in-budget) end of training, giving GPTQ a denser activation estimate and reducing post-quantization error.
GPTQ_CALIBRATION_BATCHESEverything else (model, optimizer, schedule, TTT, compression, n-gram tilt configuration) is byte-for-byte identical to the PR #2130 stack.
Compliance
This submission inherits compliance directly from PR #2130 (which itself stacks on the merged PR #1797 / PR #1855 lineage). The single lever change does not affect any compliance-relevant code path:
WITHIN_TAU=99.0, WITHIN_BOOST=0.0, WORD_TAU=99.0, WORD_BOOST=0.0, AGREE_ADD_BOOST=0.0). Logs confirmwithin_gate=0 word_gate=0 agree2plus=0in all 3 seeds; only the strictly-causaltoken_hintchannel fires (token_gate=628130). Per merged PR Record: SP8192 + Muon 0.97 + Legal Score-First TTT — val_bpb 1.07983 (3-seed mean) #1514 ruling.logit += boostand renormalizes via softmax. Probability mass is preserved by construction. Unchanged from PR Record candidate: 1.05670 BPB — token-only n-gram tilt + AsymLogit + #2060 levers + NUM_PHASES=1 #2130.eval_val_ttt_phasedpath scores each chunk before applying that chunk's LoRA update. Unchanged from PR Record candidate: 1.05670 BPB — token-only n-gram tilt + AsymLogit + #2060 levers + NUM_PHASES=1 #2130.Eval-wallclock disclosure: total eval times include the n-gram precompute, computed INSIDE the eval timer (
NGRAM_HINT_PRECOMPUTE_OUTSIDE=0). Logs explicitly showngram_hint_precompute_outside: Falseand aprecompute_doneline. All 3 seeds well under the 600s eval budget (max 567.1s, mean 558.1s).GPTQ calibration uses training shards only and runs in the training phase under
GPTQ_RESERVE_SECONDS=0.5per Issue #1017. The doubled calibration count adds a few seconds to the calibration step that the existingRESERVE_SECONDSdeduction already covers; no validation data is ever accessed during training.No SLOT, persistent n-gram cache, PPM mixture, logit bias table, or validation-derived precomputation. CaseOps byte sidecar accounting is preserved.
Architecture and training stack
Identical to PR #2130. Summary table for reviewer convenience:
Lineage and credits
The new contribution here is the targeted hyperparameter change (GPTQ_CALIBRATION_BATCHES=32), isolated as a clean ablation so reviewers can compare the lever effect against PR #2130 directly.