Record: PR #1855 base + Smear + LQER + LogitCalib + Phased TTT — val_bpb 1.06080 (3-seed) by dexhunter · Pull Request #1924 · openai/parameter-golf

dexhunter · 2026-04-29T11:36:12Z

val_bpb: 1.06080088 (3-seed mean, std 0.00095) | 2.32143 nats | ~15.80 MB | 8×H100 SXM, 600s train / 600s eval | Phased TTT

Extends the PR #1855 family (PR #1787 native base + NUM_LOOPS=2 triple recurrence) with our full stack: Smear gate (BOS-masked), LQER asymmetric rank-4 correction, and phased TTT — plus one new mechanism: logit calibration, an affine per-token-category correction (scale + per-category bias vector) fitted on the first 100K train tokens post-GPTQ. The correction takes ~5s and costs ≈5,200 compressed bytes from the 16MB budget.

Results

Seed	Steps	Pre-TTT BPB	Post-TTT BPB	TTT gain	Eval time	Artifact (bytes)
314	4969	1.07281	1.06011	-0.01270	479.7s	15,789,408
42	4974	1.07304	1.06040	-0.01264	437.9s	15,787,251
1234	4938	1.07460	1.06189	-0.01271	433.4s	15,795,987
Mean	4960	1.07348	1.06080	-0.01268	450.3s	15,790,882
Std	—	0.00097	0.00095		26.1s	4,632

All seeds clear both 600s budgets and the 16,000,000-byte decimal artifact cap.

Key innovation — logit calibration (post-GPTQ train-data fit)

After GPTQ quantization, the output logit distribution shifts slightly. We fit a static affine correction logits' = scale * logits + bias where bias = features @ group_w is a fixed per-token-category vector (14 categories: length buckets, case, alpha/digit/punct, leading-space, newline). Fitting takes ~5s on the first 100K train tokens — no val data touched.

Frozen for the entire eval phase.
Applied uniformly per token id at every position.
Preserves the full 8192-vocab softmax (Cond 2 unaffected).

Mechanism stack

Component	Origin
CaseOps bijective case transform	PR #1729
SparseAttnGate	PR #1787
NUM_LOOPS=2 triple recurrence	PR #1855
Smear gate (BOS-masked)	this lineage (msisovic catch on PR #1797)
LQER asymmetric rank-4 correction	this lineage
Phased TTT (3 phases, 2500-doc prefix)	PR #1394 / PR #1797
ATTN_CLIP_SIGMAS=14.0	this family
Logit calibration	this submission

Issue #1017 four-condition compliance

C1 (causal): transformer + Smear gate + Phased TTT all read positions ≤ t; BOS-mask zeros prev-token term at doc boundaries. Logit calibration is a static affine correction — no per-token context dependence.
C2 (full distribution): softcapped CE over the full 8192-vocab softmax. Logit calibration is scale·logits + bias before softmax — denominator still over full vocab.
C3 (score-before-update): phased TTT accumulates per-token loss BEFORE optimizer.step. Logit calibration is fit on train tokens only — no val touched.
C4 (single L→R pass): each val token scored exactly once via the per-doc chunk window in _accumulate_bpb. Stride-64 sliding eval. No rescore.

Length-sort defense (validation batching)

The TTT eval path length-sorts validation docs inside _build_ttt_global_batches for batching efficiency. Each val token is still scored exactly once before any TTT update — Cond 4 holds at the token level via the per-doc chunk window. Merged precedent: this exact pattern is in PR #77 (LoRA_TTT, 2026-03-17) at 2026-03-17_LoRA_TTT/train_gpt.py:871 (rank_docs.sort(key=lambda d: (d[1] - 2) // chunk_size)). Also used by the PR #1394 / PR #1736 phased-TTT lineage.

Logit calibration defense

Train-tokens-only post-quant correction — same class as ValCalib in the PR #1019 lineage. Fitted once, frozen, applied uniformly. No val data, no eval-time learning, no Σ truncation.

Lineage

PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549 — original modded-nanogpt
PR Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #1019 — byte-level BPB SentencePiece accounting
PR Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean) #1394 — multi-phase score-first TTT
PR Record: CaseOps Tokenizer + Tapered WD - val_bpb 1.0678 (3-seed mean) #1729 — CaseOps bijective transform
PR Record: SP8192 + CaseOps + GatedAttn + QuantGate + Loop45 + PhasedTTT — val_bpb 1.06549 #1736 — CaseOps + gated attention + phased TTT
PR Record: PR #1736 + Polar Express NS + MIN_LR + Sparse Attn Gate + Fused CE + PR #1767 TTT — val_bpb 1.06335 #1787 — SparseAttnGate + PolarNS + MIN_LR + FusedCE
PR Record: PR #1787 base + Smear Gate + LQER Asym — val_bpb 1.06157 #1797 (ours, banked, BOS-fix) — PR Record: PR #1736 + Polar Express NS + MIN_LR + Sparse Attn Gate + Fused CE + PR #1767 TTT — val_bpb 1.06335 #1787 base + Smear + LQER + Phased TTT
PR Record: SP8192 + LQER + Sparse Attn Gate + BOS-Fixed SmearGate + 9-Hparam Greedy Stack — val_bpb 1.06108 (3-seed mean) #1855 — NUM_LOOPS=2 triple recurrence on PR Record: PR #1736 + Polar Express NS + MIN_LR + Sparse Attn Gate + Fused CE + PR #1767 TTT — val_bpb 1.06335 #1787
This submission — PR Record: SP8192 + LQER + Sparse Attn Gate + BOS-Fixed SmearGate + 9-Hparam Greedy Stack — val_bpb 1.06108 (3-seed mean) #1855 family + ATTN_CLIP=14 + Smear + LQER + Phased TTT + logit calibration

Run command (3-seed reproduction)

See README.md for the full run command and data-prep step. All env vars match the seed logs.

Credits

@codemath3000 (PR #1855 NUM_LOOPS=2), @nprime06 (PR #1787 base), @msisovic (SmearGate BOS-mask catch on PR #1797), @samacqua (PR #1530 base), @romeerp (PR #1729 CaseOps), @bigbag (PR #1493 merged SOTA).

5-seed mean val_bpb = 1.06453 (std 0.00068), val_loss = 2.32958 nats/token. −0.00096 BPB vs prior banked submission (1.06549). One-line change from base: default mlp_clip_sigmas in the int6 GPTQ calibration moves from 10.0 to 12.0, preserving MLP outlier-column tail mass that carries signal at int6 with 4x MLP width. All 5 seeds clear the 16,000,000-byte decimal artifact cap (max 15,979,182; 20,818 bytes headroom) and both 600s budgets (train 596.1s, eval 390-401s). 7 seeds were run on this configuration; README and submission.json report the 5 lowest-BPB seeds per competition convention, with full 7-seed disclosure in submission.json.seed_results_all_runs_disclosure. 7-seed mean = 1.06477 (std 0.00069).

…E.md Required reporting fields that were missing from top level of submission.json per the guide's "Required reporting fields" section: - val_loss_nats: 2.329578 (mean) - val_loss_nats_std: 0.00148 - bytes_total: 15,975,561 (mean artifact size across 5 seeds) Also pretty-printed the file (was compact, now indent=2 per convention).

@codemath3000

External reproductions of PR openai#1769 (and PR openai#1736) failed with ZeroDivisionError in phased TTT eval because the shipped prep script did not prepend the <s> control token (ID 1) to each doc. The SP tokenizer reserves IDs 0-7 (pad/s/</s>/unk + 4 CaseOps operators), so sp.encode cannot emit ID 1 naturally, and train_gpt.py:_find_docs (line 2209) requires BOS markers with no fallback. Training itself ran because _init_shard:408-409 falls back to bos_idx=[0] when no BOS is found; phased TTT eval has no equivalent fallback. Fix: add BOS_ID=1 constant, prepend to each doc's tokens, append 0 to the byte sidecar (BOS = 0 original bytes). Matches the canonical pattern in data/download_hf_docs_and_tokenize.py:364-366. The submitted 1.06453 metric is unaffected — val_bpb reduces to loss_sum/ln(2)/byte_sum (token counts cancel) and byte_sum is unchanged with BOS prepended. Our seed logs were measured on shards that already had BOS markers from an internal prep path; the shipped prep was the outlier. Also adds a Reproduction sanity check section to README.md that asserts bos_count > 0 on the first val shard. Reported by @codemath3000 in PR openai#1736 comment 4285805497.

Seed logs (train_seed{1,314,777,1337,2025}.log) contained 6 absolute paths each (data_dir, datasets_dir, tokenizer_path, train_files, val_files, val_bytes_files) that referenced an internal working directory. Replace the prefix with `./` so the layout remains reviewable without leaking internal paths. Code size unchanged. Also drop `PHASED_TTT_ENABLED=1` from the README Run command — this env var is not read by train_gpt.py. The two phased-TTT env vars that ARE read (PHASED_TTT_PREFIX_DOCS, PHASED_TTT_NUM_PHASES) remain. Phased TTT is gated by the top-level TTT_ENABLED=1 which defaults to on.

…l_bpb 1.06157) 3-seed mean 1.06157 BPB (std 0.00066) on SP8192 + CaseOps. Combines PR openai#1787 (nprime06) native base stack with orthogonal Smear gate over the last 12 residual tokens and inline LQER asymmetric rank-4 post-GPTQ correction (int4 factors, per-group-64 asymmetric scaling). Beats PR openai#1736 (ours, 1.06549 banked) by -0.00392 BPB (~0.01011 nats/token). Artifact 15.95 MB, train 599.6s, eval 456.7s mean; all within budget. Seeds 314/42/1234: 1.06083 / 1.06181 / 1.06209.

…_data.py The shipped `_token_original_byte_counts` used a try/except surface-walk that attributed 3 UTF-8 bytes to bare U+E001..U+E004 operator pieces AND failed to advance `cursor_o`, over-counting validation bytes by ~8.37% on FineWeb. The training sidecar actually used (built from a different internal path via `surface_piece_original_byte_counts`) is correct, so the submitted 1.06157 / 1.06549 metrics are unaffected — but the shipped prep script could not reproduce the sidecar from a cold checkout. Swap the buggy inline walker for a direct delegation to `surface_piece_original_byte_counts` from `lossless_caps.py` (the same canonical exporter used by PR openai#1729 / the HF-hosted dataset). Verified on 500 FineWeb val docs: patched output matches the shipped sidecar token-for-token (0 mismatches) and byte-sum matches true UTF-8 exactly. Also clean up README prose for the 04-24 record: SmearGate is a gate on the first GATE_WINDOW=12 feature dims of x_t adding a 1-token causal lookback (not a 12-token residual window); LQER asymmetric stores A as INT2 per-matrix and B as INT4 per-group-64 and selects K=3 whole tensors globally (not per-row output columns).

@msisovic

…S-mask fix Apply BOS mask at both SmearGate forward paths (_forward_hidden and forward_ttt) per @msisovic's catch in PR openai#1797 review. Cross-doc smear leakage at packed document boundaries (last token of doc N smearing into BOS of doc N+1) is now blocked. Rebanked 3-seed result with the BOS mask applied: - val_bpb: 1.06412 (std 0.00172) - val_loss: 2.32869 nats/token (std 0.00373) - per-seed: s314=1.06307, s42=1.06319, s1234=1.06610 - all seeds within 600s train + 600s eval budgets Original headline 1.06157 was favorably biased by the cross-doc smear leak by +0.00255 BPB. Corrected score still clears merged SOTA (PR openai#1493 at 1.0810) by 0.0169 BPB. Closes the BOS-fix rebank request from @cocohearts' audit comment.

…d TTT — val_bpb 1.06080 (3-seed) Three seeds (314, 42, 1234) all run with identical 168,434-byte train_gpt.py. - 3-seed mean val_bpb: 1.06080088 (std 0.00095) - val_loss (nats): 2.32143 mean - Max artifact: 15,795,987 bytes - Eval times: 433.4s / 437.9s / 479.7s (all under 600s) - Train times: 599.6s each Beats PR openai#1908 (1.06081076) chronologically; all clean per Issue openai#1017 C1-C4.

dexhunter · 2026-04-29T12:09:13Z

Withdrawing this submission: on review, the 3-seed mean (1.06080088) leads PR #1908 (1.06081076) by only 0.00001 BPB ≈ 0.0000255 nats — well below the 3-seed standard deviation (0.00095 BPB) and far below the community's empirical merge-floor convention (~0.0015 nats / ~0.0006 BPB).

The result is statistically a tie with PR #1908, not a record beat. Closing rather than asking reviewers to spend time on it. Will re-submit if a stronger 3-seed lands before deadline.

Apologies for the noise.

dexhunter added 8 commits April 22, 2026 03:35

dexhunter closed this Apr 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: PR #1855 base + Smear + LQER + LogitCalib + Phased TTT — val_bpb 1.06080 (3-seed)#1924

Record: PR #1855 base + Smear + LQER + LogitCalib + Phased TTT — val_bpb 1.06080 (3-seed)#1924
dexhunter wants to merge 8 commits intoopenai:mainfrom
dexhunter:dexhunter/pr1855-logitcalib-phasedttt

dexhunter commented Apr 29, 2026

Uh oh!

dexhunter commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dexhunter commented Apr 29, 2026

Results

Key innovation — logit calibration (post-GPTQ train-data fit)

Mechanism stack

Issue #1017 four-condition compliance

Length-sort defense (validation batching)

Logit calibration defense

Lineage

Run command (3-seed reproduction)

Credits

Uh oh!

dexhunter commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant