Skip to content

Record candidate: PR #2130 base + GPTQ_CALIBRATION_BATCHES=32 — val_bpb 1.05651 (3-seed mean)#2135

Open
codemath3000 wants to merge 3 commits intoopenai:mainfrom
codemath3000:submission/pr2130-calib32
Open

Record candidate: PR #2130 base + GPTQ_CALIBRATION_BATCHES=32 — val_bpb 1.05651 (3-seed mean)#2135
codemath3000 wants to merge 3 commits intoopenai:mainfrom
codemath3000:submission/pr2130-calib32

Conversation

@codemath3000
Copy link
Copy Markdown
Contributor

@codemath3000 codemath3000 commented May 1, 2026

Record candidate: PR #1797 base + Token-only n-gram tilt + AsymLogit Rescale + #2060 levers + NUM_PHASES=1 + GPTQ_CALIBRATION_BATCHES=32

val_bpb: 1.05651 (3-seed mean, std 0.00036) | val_loss: 2.31203 nats (std 0.00078) | 15.95 MB max | 8xH100 SXM | 600s train / 600s eval

Improvement over merged PR #1855 leaderboard record (1.06107587 BPB):
-0.00457 BPB / -0.01000 nats

The -0.01000-nat improvement clears the README's record-acceptance threshold of 0.005 nats by approximately 2x.

This submission keeps the PR #2130 architecture/training stack identical and only changes one knob: GPTQ_CALIBRATION_BATCHES=32 (PR #2130 used 16). Every other hyperparameter, env var, and code path is byte-for-byte the PR #2130 reproduction command.

Note on submission timing

This PR was opened at 2026-05-01 (before the deadline) with the full code, environment, and submission scaffold in place. The per-seed run logs and result numbers were filled in afterward. The official README makes the policy explicit: "we're accepting record submissions chronologically depending on their PR creation time."

The PR #1851 / PR #1868 pair is a direct precedent. PR #1851 was opened with incomplete seed results (the headline value sat between the then-current SOTA and the next eventual record, PR #1855). PR #1868 later filled in the missing seed data for the same submission. When the package was evaluated, it was treated as record-relevant on PR #1851's chronological standing, not PR #1868's: the original PR's creation time was the dispositive timestamp, even though the data needed to substantiate the claim arrived later. The same principle applies here: PR creation time, not results-posting time, is what establishes chronological standing. Under this rule, the present submission counts as filed before the deadline and is therefore valid.

Results

Seed Steps ms/step Train ms Pre-quant BPB Quant BPB Post-TTT BPB TTT eval s Artifact bytes
0 4,997 120.0 599,608 1.06105556 1.06939370 1.05679341 567.1 15,942,822
42 5,001 119.9 599,672 1.06026908 1.06867913 1.05610947 540.1 15,947,490
314 4,983 120.3 599,593 1.06091124 1.06921334 1.05662016 567.1 15,945,305
Mean 4,994 120.1 599,624 1.06074529 1.06909539 1.05650768 558.1 15,945,206

3-seed sample std: 0.00035573 BPB / 0.00077803 nats.

All seeds under the 16,000,000-byte artifact cap and 600s train/eval budgets. Maximum artifact is 15,947,490 bytes and the maximum eval pass is 567.1s.

Statistical significance vs PR #2130 (immediate prior candidate)

Both PRs were evaluated on the same seed set {0, 42, 314}. With identical seeds on both sides, the paired t-test is the appropriate comparison since each seed pair shares its random initialization, removing the seed-to-seed component of the variance and giving the test the most power. For completeness, we also report Welch's t-test (unbiased σ) treating the two sets as independent samples; both tests clear the threshold.

Test t-stat df p (one-tailed) Verdict at p<0.25
Paired t-test (primary, same seeds) -1.48 2 0.138 PASSES
Welch's t-test (unbiased σ, for completeness) -0.82 2.96 0.238 PASSES

Per-seed post-TTT BPB (full precision, from each PR's run logs):

Seed PR #2130 This submission Δ
0 1.05689587 1.05679341 -0.00010246
42 1.05654656 1.05610947 -0.00043709
314 1.05664400 1.05662016 -0.00002384

We tie or beat PR #2130 on every seed.

What is new in this submission

A single hyperparameter change on the PR #2130 stack: GPTQ_CALIBRATION_BATCHES, increased from 16 to 32. This doubles the number of GPTQ Hessian calibration batches at the (still in-budget) end of training, giving GPTQ a denser activation estimate and reducing post-quantization error.

Lever PR #2130 This submission Mechanism
GPTQ_CALIBRATION_BATCHES 16 32 More calibration batches for GPTQ Hessian estimation; better activation statistics, lower post-quant error.

Everything else (model, optimizer, schedule, TTT, compression, n-gram tilt configuration) is byte-for-byte identical to the PR #2130 stack.

Compliance

This submission inherits compliance directly from PR #2130 (which itself stacks on the merged PR #1797 / PR #1855 lineage). The single lever change does not affect any compliance-relevant code path:

Eval-wallclock disclosure: total eval times include the n-gram precompute, computed INSIDE the eval timer (NGRAM_HINT_PRECOMPUTE_OUTSIDE=0). Logs explicitly show ngram_hint_precompute_outside: False and a precompute_done line. All 3 seeds well under the 600s eval budget (max 567.1s, mean 558.1s).

GPTQ calibration uses training shards only and runs in the training phase under GPTQ_RESERVE_SECONDS=0.5 per Issue #1017. The doubled calibration count adds a few seconds to the calibration step that the existing RESERVE_SECONDS deduction already covers; no validation data is ever accessed during training.

No SLOT, persistent n-gram cache, PPM mixture, logit bias table, or validation-derived precomputation. CaseOps byte sidecar accounting is preserved.

Architecture and training stack

Identical to PR #2130. Summary table for reviewer convenience:

Component Setting
Model 11 layers, 512d, 8 query heads, 4 KV heads, MLP 4x
Tokenizer/data SP8192 CaseOps lossless caps, byte sidecar accounting (PR #1729 / #1736 lineage)
Recurrence Layers 3-5 looped at frac=0.35, parallel decoder layer 8+
Gates BOS-fixed SmearGate (GATE_WINDOW=12), SparseAttnGate (scale=0.5)
Optimizer Muon on matrix params (LR=0.028), Adam on embedding/scalars (BETA2=0.99)
EMA EMA_DECAY=0.9965
Quantization GPTQ int6 matrices, int7 embeddings, LQER asymmetric rank-4 (GROUP=32, TOP_K=3), GPTQ_RESERVE_SECONDS=0.5
Compression per-group (lrzip + brotli)
Eval context EVAL_SEQ_LEN=2560, TTT_EVAL_SEQ_LEN=2560
TTT Quantized phased LoRA (RANK=80, LR=8e-5, BETA2=0.99, WEIGHT_DECAY=2.0), score-first, K-off, 1 phase, 2500-doc prefix
Logit softcap AsymLogit Rescale (softcap_pos, softcap_neg, init 30.0)
Tilt Token-only n-gram tilt (TOKEN_ORDER=16, TOKEN_THRESHOLD=0.800, TOKEN_BOOST=2.625)

Lineage and credits

The new contribution here is the targeted hyperparameter change (GPTQ_CALIBRATION_BATCHES=32), isolated as a clean ablation so reviewers can compare the lever effect against PR #2130 directly.

@codemath3000 codemath3000 changed the title Record candidate: PR #2130 base + GPTQ_CALIBRATION_BATCHES=32 Record candidate: PR #2130 base + GPTQ_CALIBRATION_BATCHES=32 — val_bpb 1.05651 (3-seed mean) May 2, 2026
sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request May 2, 2026
…es, paper scan

Post-deadline PR activity: PR openai#2138 Lock-In Byte Mixer confirmed BPB bug
(corrected ~1.0671, not 0.979556); PR openai#2135 codemath3000 1.05651 narrowly
misses 0.005 threshold; PR openai#2139 TTT Peer-LoRA Ensemble novel technique;
PR openai#2140 flagged for target-token n-gram gating violation. New papers:
BBQ quantization (ICLR 2026, arXiv:2603.01599), EntroLLM (2505.02380),
In-Place TTT NTP-aligned (2604.06169).

https://claude.ai/code/session_01CxuVyZaKMxMMc8Q4sMb2dF
@cocohearts
Copy link
Copy Markdown
Collaborator

cocohearts commented May 2, 2026

Update after applying the grace policy: this is now included in leaderboard PR #2146.

The PR/code/scaffold were opened before cutoff, and the post-cutoff commits supplied the run logs/results. I also verified the submitted logs use clean canonical CaseOps data (train_shards: 80, no doubled local datasets/datasets path, val_tokens: 47851520) and token-only n-gram gates (within_gate=0 word_gate=0 agree2plus=0). The leaderboard row uses the submitted 3-seed mean 1.05650768.

@codemath3000
Copy link
Copy Markdown
Contributor Author

codemath3000 commented May 2, 2026

@cocohearts Thanks so much for taking the time to review this. The "scored state must be in by cutoff" criterion isn't in the README. The official README sets the rule: "Since all submissions are public, we're accepting record submissions chronologically depending on their PR creation time." That pins ordering on PR creation, not on when content reached completion.

PR #2135 was opened at 2026-05-01 23:48:57Z (4:48 PM PT, ~11 min before the 5 PM PT cutoff), so it satisfies that test.

The pre-cutoff commit (be7420d) shipped the full code surface: train_gpt.py, lossless_caps.py, online_ngram_state.c, online_ngram_tilt.py, prepare_caseops_data.py, the tokenizer .model, requirements.txt, and scaffold README/submission.json. The two post-cutoff commits (f086a9f, ff90522) only added the train_seed*.log run outputs and filled result numerics into README.md and submission.json. No methodology, architecture, training script, or tokenizer changed post-cutoff.

The seed log files added afterward are run outputs of the pre-cutoff code, not new submission content. Anyone with 8xH100 SXM access could clone be7420d and regenerate the runs within run-to-run noise. What was "missing" at cutoff wasn't the submission; it was the empirical measurement of a submission that was already public and reproducible.

Merged precedent for post-creation supplementation: PR #1851 was supplemented via PR #1868, after PR #1855 had already taken over the leaderboard. The combined submission was still accepted as part of the leaderboard record. Applied consistently, the "content complete by cutoff" rule would retroactively reject that combined submission (already part of the merged leaderboard): PR #1851's scored 3-seed state was completed by PR #1868 well after #1851's own creation and after PR #1855 had displaced it.

Thanks so much again for taking the time to review.

sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request May 3, 2026
…AsymLogit Rescale, 7th BPB bug

- logs/daily_research.md: new May 3 entry; DRAFT PR openai#2146 grace-policy audit adds 4 records
  (pending SOTA 1.05651 via PR openai#2135); AsymLogit Rescale documented (~5 lines, zero legality
  risk); PR openai#2124 seed/config inconsistency; PR openai#2138 BPB bug #7 confirmed; data overlap
  hazard in PR openai#2130 flagged; no new high-relevance papers beyond prior scan.
- CLAUDE.md: Competition Strategy updated to reflect closed competition, pending audit status,
  and key post-competition findings (AsymLogit Rescale, GPTQ calibration batches, data overlap
  isolation requirement).

https://claude.ai/code/session_013Q2rFE4xRHRRYaSPfzCiip
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants