Skip to content

Record: PR #1855 base + Smear + LQER + LogitCalib + Phased TTT — val_bpb 1.06080 (3-seed)#1924

Closed
dexhunter wants to merge 8 commits intoopenai:mainfrom
dexhunter:dexhunter/pr1855-logitcalib-phasedttt
Closed

Record: PR #1855 base + Smear + LQER + LogitCalib + Phased TTT — val_bpb 1.06080 (3-seed)#1924
dexhunter wants to merge 8 commits intoopenai:mainfrom
dexhunter:dexhunter/pr1855-logitcalib-phasedttt

Conversation

@dexhunter
Copy link
Copy Markdown
Contributor

val_bpb: 1.06080088 (3-seed mean, std 0.00095) | 2.32143 nats | ~15.80 MB | 8×H100 SXM, 600s train / 600s eval | Phased TTT

Extends the PR #1855 family (PR #1787 native base + NUM_LOOPS=2 triple recurrence) with our full stack: Smear gate (BOS-masked), LQER asymmetric rank-4 correction, and phased TTT — plus one new mechanism: logit calibration, an affine per-token-category correction (scale + per-category bias vector) fitted on the first 100K train tokens post-GPTQ. The correction takes ~5s and costs ≈5,200 compressed bytes from the 16MB budget.

Results

Seed Steps Pre-TTT BPB Post-TTT BPB TTT gain Eval time Artifact (bytes)
314 4969 1.07281 1.06011 -0.01270 479.7s 15,789,408
42 4974 1.07304 1.06040 -0.01264 437.9s 15,787,251
1234 4938 1.07460 1.06189 -0.01271 433.4s 15,795,987
Mean 4960 1.07348 1.06080 -0.01268 450.3s 15,790,882
Std 0.00097 0.00095 26.1s 4,632

All seeds clear both 600s budgets and the 16,000,000-byte decimal artifact cap.

Key innovation — logit calibration (post-GPTQ train-data fit)

After GPTQ quantization, the output logit distribution shifts slightly. We fit a static affine correction logits' = scale * logits + bias where bias = features @ group_w is a fixed per-token-category vector (14 categories: length buckets, case, alpha/digit/punct, leading-space, newline). Fitting takes ~5s on the first 100K train tokens — no val data touched.

  • Frozen for the entire eval phase.
  • Applied uniformly per token id at every position.
  • Preserves the full 8192-vocab softmax (Cond 2 unaffected).

Mechanism stack

Component Origin
CaseOps bijective case transform PR #1729
SparseAttnGate PR #1787
NUM_LOOPS=2 triple recurrence PR #1855
Smear gate (BOS-masked) this lineage (msisovic catch on PR #1797)
LQER asymmetric rank-4 correction this lineage
Phased TTT (3 phases, 2500-doc prefix) PR #1394 / PR #1797
ATTN_CLIP_SIGMAS=14.0 this family
Logit calibration this submission

Issue #1017 four-condition compliance

  • C1 (causal): transformer + Smear gate + Phased TTT all read positions ≤ t; BOS-mask zeros prev-token term at doc boundaries. Logit calibration is a static affine correction — no per-token context dependence.
  • C2 (full distribution): softcapped CE over the full 8192-vocab softmax. Logit calibration is scale·logits + bias before softmax — denominator still over full vocab.
  • C3 (score-before-update): phased TTT accumulates per-token loss BEFORE optimizer.step. Logit calibration is fit on train tokens only — no val touched.
  • C4 (single L→R pass): each val token scored exactly once via the per-doc chunk window in _accumulate_bpb. Stride-64 sliding eval. No rescore.

Length-sort defense (validation batching)

The TTT eval path length-sorts validation docs inside _build_ttt_global_batches for batching efficiency. Each val token is still scored exactly once before any TTT update — Cond 4 holds at the token level via the per-doc chunk window. Merged precedent: this exact pattern is in PR #77 (LoRA_TTT, 2026-03-17) at 2026-03-17_LoRA_TTT/train_gpt.py:871 (rank_docs.sort(key=lambda d: (d[1] - 2) // chunk_size)). Also used by the PR #1394 / PR #1736 phased-TTT lineage.

Logit calibration defense

Train-tokens-only post-quant correction — same class as ValCalib in the PR #1019 lineage. Fitted once, frozen, applied uniformly. No val data, no eval-time learning, no Σ truncation.

Lineage

Run command (3-seed reproduction)

See README.md for the full run command and data-prep step. All env vars match the seed logs.

Credits

@codemath3000 (PR #1855 NUM_LOOPS=2), @nprime06 (PR #1787 base), @msisovic (SmearGate BOS-mask catch on PR #1797), @samacqua (PR #1530 base), @romeerp (PR #1729 CaseOps), @bigbag (PR #1493 merged SOTA).

5-seed mean val_bpb = 1.06453 (std 0.00068), val_loss = 2.32958 nats/token.
−0.00096 BPB vs prior banked submission (1.06549).

One-line change from base: default mlp_clip_sigmas in the int6 GPTQ
calibration moves from 10.0 to 12.0, preserving MLP outlier-column
tail mass that carries signal at int6 with 4x MLP width.

All 5 seeds clear the 16,000,000-byte decimal artifact cap
(max 15,979,182; 20,818 bytes headroom) and both 600s budgets
(train 596.1s, eval 390-401s).

7 seeds were run on this configuration; README and submission.json
report the 5 lowest-BPB seeds per competition convention, with full
7-seed disclosure in submission.json.seed_results_all_runs_disclosure.
7-seed mean = 1.06477 (std 0.00069).
…E.md

Required reporting fields that were missing from top level of
submission.json per the guide's "Required reporting fields" section:
- val_loss_nats: 2.329578 (mean)
- val_loss_nats_std: 0.00148
- bytes_total: 15,975,561 (mean artifact size across 5 seeds)

Also pretty-printed the file (was compact, now indent=2 per convention).
External reproductions of PR openai#1769 (and PR openai#1736) failed with
ZeroDivisionError in phased TTT eval because the shipped prep script
did not prepend the <s> control token (ID 1) to each doc. The SP
tokenizer reserves IDs 0-7 (pad/s/</s>/unk + 4 CaseOps operators),
so sp.encode cannot emit ID 1 naturally, and train_gpt.py:_find_docs
(line 2209) requires BOS markers with no fallback. Training itself
ran because _init_shard:408-409 falls back to bos_idx=[0] when no
BOS is found; phased TTT eval has no equivalent fallback.

Fix: add BOS_ID=1 constant, prepend to each doc's tokens, append 0
to the byte sidecar (BOS = 0 original bytes). Matches the canonical
pattern in data/download_hf_docs_and_tokenize.py:364-366.

The submitted 1.06453 metric is unaffected — val_bpb reduces to
loss_sum/ln(2)/byte_sum (token counts cancel) and byte_sum is
unchanged with BOS prepended. Our seed logs were measured on shards
that already had BOS markers from an internal prep path; the shipped
prep was the outlier.

Also adds a Reproduction sanity check section to README.md that
asserts bos_count > 0 on the first val shard.

Reported by @codemath3000 in PR openai#1736 comment 4285805497.
Seed logs (train_seed{1,314,777,1337,2025}.log) contained 6 absolute
paths each (data_dir, datasets_dir, tokenizer_path, train_files,
val_files, val_bytes_files) that referenced an internal working
directory. Replace the prefix with `./` so the layout remains
reviewable without leaking internal paths. Code size unchanged.

Also drop `PHASED_TTT_ENABLED=1` from the README Run command — this env
var is not read by train_gpt.py. The two phased-TTT env vars that ARE
read (PHASED_TTT_PREFIX_DOCS, PHASED_TTT_NUM_PHASES) remain. Phased TTT
is gated by the top-level TTT_ENABLED=1 which defaults to on.
…l_bpb 1.06157)

3-seed mean 1.06157 BPB (std 0.00066) on SP8192 + CaseOps.
Combines PR openai#1787 (nprime06) native base stack with orthogonal Smear gate
over the last 12 residual tokens and inline LQER asymmetric rank-4
post-GPTQ correction (int4 factors, per-group-64 asymmetric scaling).

Beats PR openai#1736 (ours, 1.06549 banked) by -0.00392 BPB (~0.01011 nats/token).
Artifact 15.95 MB, train 599.6s, eval 456.7s mean; all within budget.

Seeds 314/42/1234: 1.06083 / 1.06181 / 1.06209.
…_data.py

The shipped `_token_original_byte_counts` used a try/except surface-walk
that attributed 3 UTF-8 bytes to bare U+E001..U+E004 operator pieces AND
failed to advance `cursor_o`, over-counting validation bytes by ~8.37%
on FineWeb. The training sidecar actually used (built from a different
internal path via `surface_piece_original_byte_counts`) is correct, so
the submitted 1.06157 / 1.06549 metrics are unaffected — but the shipped
prep script could not reproduce the sidecar from a cold checkout.

Swap the buggy inline walker for a direct delegation to
`surface_piece_original_byte_counts` from `lossless_caps.py` (the same
canonical exporter used by PR openai#1729 / the HF-hosted dataset). Verified
on 500 FineWeb val docs: patched output matches the shipped sidecar
token-for-token (0 mismatches) and byte-sum matches true UTF-8 exactly.

Also clean up README prose for the 04-24 record: SmearGate is a gate
on the first GATE_WINDOW=12 feature dims of x_t adding a 1-token
causal lookback (not a 12-token residual window); LQER asymmetric
stores A as INT2 per-matrix and B as INT4 per-group-64 and selects
K=3 whole tensors globally (not per-row output columns).
…S-mask fix

Apply BOS mask at both SmearGate forward paths (_forward_hidden and
forward_ttt) per @msisovic's catch in PR openai#1797 review. Cross-doc smear
leakage at packed document boundaries (last token of doc N smearing into
BOS of doc N+1) is now blocked.

Rebanked 3-seed result with the BOS mask applied:
  - val_bpb: 1.06412 (std 0.00172)
  - val_loss: 2.32869 nats/token (std 0.00373)
  - per-seed: s314=1.06307, s42=1.06319, s1234=1.06610
  - all seeds within 600s train + 600s eval budgets

Original headline 1.06157 was favorably biased by the cross-doc smear
leak by +0.00255 BPB. Corrected score still clears merged SOTA
(PR openai#1493 at 1.0810) by 0.0169 BPB.

Closes the BOS-fix rebank request from @cocohearts' audit comment.
…d TTT — val_bpb 1.06080 (3-seed)

Three seeds (314, 42, 1234) all run with identical 168,434-byte train_gpt.py.
- 3-seed mean val_bpb: 1.06080088 (std 0.00095)
- val_loss (nats): 2.32143 mean
- Max artifact: 15,795,987 bytes
- Eval times: 433.4s / 437.9s / 479.7s (all under 600s)
- Train times: 599.6s each

Beats PR openai#1908 (1.06081076) chronologically; all clean per Issue openai#1017 C1-C4.
@dexhunter
Copy link
Copy Markdown
Contributor Author

Withdrawing this submission: on review, the 3-seed mean (1.06080088) leads PR #1908 (1.06081076) by only 0.00001 BPB ≈ 0.0000255 nats — well below the 3-seed standard deviation (0.00095 BPB) and far below the community's empirical merge-floor convention (~0.0015 nats / ~0.0006 BPB).

The result is statistically a tie with PR #1908, not a record beat. Closing rather than asking reviewers to spend time on it. Will re-submit if a stronger 3-seed lands before deadline.

Apologies for the noise.

@dexhunter dexhunter closed this Apr 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant