Record: PR #1855 base + Smear + LQER + LogitCalib + Phased TTT — val_bpb 1.06080 (3-seed)#1924
Closed
dexhunter wants to merge 8 commits intoopenai:mainfrom
Closed
Record: PR #1855 base + Smear + LQER + LogitCalib + Phased TTT — val_bpb 1.06080 (3-seed)#1924dexhunter wants to merge 8 commits intoopenai:mainfrom
dexhunter wants to merge 8 commits intoopenai:mainfrom
Conversation
5-seed mean val_bpb = 1.06453 (std 0.00068), val_loss = 2.32958 nats/token. −0.00096 BPB vs prior banked submission (1.06549). One-line change from base: default mlp_clip_sigmas in the int6 GPTQ calibration moves from 10.0 to 12.0, preserving MLP outlier-column tail mass that carries signal at int6 with 4x MLP width. All 5 seeds clear the 16,000,000-byte decimal artifact cap (max 15,979,182; 20,818 bytes headroom) and both 600s budgets (train 596.1s, eval 390-401s). 7 seeds were run on this configuration; README and submission.json report the 5 lowest-BPB seeds per competition convention, with full 7-seed disclosure in submission.json.seed_results_all_runs_disclosure. 7-seed mean = 1.06477 (std 0.00069).
…E.md Required reporting fields that were missing from top level of submission.json per the guide's "Required reporting fields" section: - val_loss_nats: 2.329578 (mean) - val_loss_nats_std: 0.00148 - bytes_total: 15,975,561 (mean artifact size across 5 seeds) Also pretty-printed the file (was compact, now indent=2 per convention).
External reproductions of PR openai#1769 (and PR openai#1736) failed with ZeroDivisionError in phased TTT eval because the shipped prep script did not prepend the <s> control token (ID 1) to each doc. The SP tokenizer reserves IDs 0-7 (pad/s/</s>/unk + 4 CaseOps operators), so sp.encode cannot emit ID 1 naturally, and train_gpt.py:_find_docs (line 2209) requires BOS markers with no fallback. Training itself ran because _init_shard:408-409 falls back to bos_idx=[0] when no BOS is found; phased TTT eval has no equivalent fallback. Fix: add BOS_ID=1 constant, prepend to each doc's tokens, append 0 to the byte sidecar (BOS = 0 original bytes). Matches the canonical pattern in data/download_hf_docs_and_tokenize.py:364-366. The submitted 1.06453 metric is unaffected — val_bpb reduces to loss_sum/ln(2)/byte_sum (token counts cancel) and byte_sum is unchanged with BOS prepended. Our seed logs were measured on shards that already had BOS markers from an internal prep path; the shipped prep was the outlier. Also adds a Reproduction sanity check section to README.md that asserts bos_count > 0 on the first val shard. Reported by @codemath3000 in PR openai#1736 comment 4285805497.
Seed logs (train_seed{1,314,777,1337,2025}.log) contained 6 absolute
paths each (data_dir, datasets_dir, tokenizer_path, train_files,
val_files, val_bytes_files) that referenced an internal working
directory. Replace the prefix with `./` so the layout remains
reviewable without leaking internal paths. Code size unchanged.
Also drop `PHASED_TTT_ENABLED=1` from the README Run command — this env
var is not read by train_gpt.py. The two phased-TTT env vars that ARE
read (PHASED_TTT_PREFIX_DOCS, PHASED_TTT_NUM_PHASES) remain. Phased TTT
is gated by the top-level TTT_ENABLED=1 which defaults to on.
…l_bpb 1.06157) 3-seed mean 1.06157 BPB (std 0.00066) on SP8192 + CaseOps. Combines PR openai#1787 (nprime06) native base stack with orthogonal Smear gate over the last 12 residual tokens and inline LQER asymmetric rank-4 post-GPTQ correction (int4 factors, per-group-64 asymmetric scaling). Beats PR openai#1736 (ours, 1.06549 banked) by -0.00392 BPB (~0.01011 nats/token). Artifact 15.95 MB, train 599.6s, eval 456.7s mean; all within budget. Seeds 314/42/1234: 1.06083 / 1.06181 / 1.06209.
…_data.py The shipped `_token_original_byte_counts` used a try/except surface-walk that attributed 3 UTF-8 bytes to bare U+E001..U+E004 operator pieces AND failed to advance `cursor_o`, over-counting validation bytes by ~8.37% on FineWeb. The training sidecar actually used (built from a different internal path via `surface_piece_original_byte_counts`) is correct, so the submitted 1.06157 / 1.06549 metrics are unaffected — but the shipped prep script could not reproduce the sidecar from a cold checkout. Swap the buggy inline walker for a direct delegation to `surface_piece_original_byte_counts` from `lossless_caps.py` (the same canonical exporter used by PR openai#1729 / the HF-hosted dataset). Verified on 500 FineWeb val docs: patched output matches the shipped sidecar token-for-token (0 mismatches) and byte-sum matches true UTF-8 exactly. Also clean up README prose for the 04-24 record: SmearGate is a gate on the first GATE_WINDOW=12 feature dims of x_t adding a 1-token causal lookback (not a 12-token residual window); LQER asymmetric stores A as INT2 per-matrix and B as INT4 per-group-64 and selects K=3 whole tensors globally (not per-row output columns).
…S-mask fix Apply BOS mask at both SmearGate forward paths (_forward_hidden and forward_ttt) per @msisovic's catch in PR openai#1797 review. Cross-doc smear leakage at packed document boundaries (last token of doc N smearing into BOS of doc N+1) is now blocked. Rebanked 3-seed result with the BOS mask applied: - val_bpb: 1.06412 (std 0.00172) - val_loss: 2.32869 nats/token (std 0.00373) - per-seed: s314=1.06307, s42=1.06319, s1234=1.06610 - all seeds within 600s train + 600s eval budgets Original headline 1.06157 was favorably biased by the cross-doc smear leak by +0.00255 BPB. Corrected score still clears merged SOTA (PR openai#1493 at 1.0810) by 0.0169 BPB. Closes the BOS-fix rebank request from @cocohearts' audit comment.
…d TTT — val_bpb 1.06080 (3-seed) Three seeds (314, 42, 1234) all run with identical 168,434-byte train_gpt.py. - 3-seed mean val_bpb: 1.06080088 (std 0.00095) - val_loss (nats): 2.32143 mean - Max artifact: 15,795,987 bytes - Eval times: 433.4s / 437.9s / 479.7s (all under 600s) - Train times: 599.6s each Beats PR openai#1908 (1.06081076) chronologically; all clean per Issue openai#1017 C1-C4.
Contributor
Author
|
Withdrawing this submission: on review, the 3-seed mean (1.06080088) leads PR #1908 (1.06081076) by only 0.00001 BPB ≈ 0.0000255 nats — well below the 3-seed standard deviation (0.00095 BPB) and far below the community's empirical merge-floor convention (~0.0015 nats / ~0.0006 BPB). The result is statistically a tie with PR #1908, not a record beat. Closing rather than asking reviewers to spend time on it. Will re-submit if a stronger 3-seed lands before deadline. Apologies for the noise. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
val_bpb: 1.06080088 (3-seed mean, std 0.00095) | 2.32143 nats | ~15.80 MB | 8×H100 SXM, 600s train / 600s eval | Phased TTT
Extends the PR #1855 family (PR #1787 native base + NUM_LOOPS=2 triple recurrence) with our full stack: Smear gate (BOS-masked), LQER asymmetric rank-4 correction, and phased TTT — plus one new mechanism: logit calibration, an affine per-token-category correction (scale + per-category bias vector) fitted on the first 100K train tokens post-GPTQ. The correction takes ~5s and costs ≈5,200 compressed bytes from the 16MB budget.
Results
All seeds clear both 600s budgets and the 16,000,000-byte decimal artifact cap.
Key innovation — logit calibration (post-GPTQ train-data fit)
After GPTQ quantization, the output logit distribution shifts slightly. We fit a static affine correction
logits' = scale * logits + biaswherebias = features @ group_wis a fixed per-token-category vector (14 categories: length buckets, case, alpha/digit/punct, leading-space, newline). Fitting takes ~5s on the first 100K train tokens — no val data touched.Mechanism stack
Issue #1017 four-condition compliance
scale·logits + biasbefore softmax — denominator still over full vocab._accumulate_bpb. Stride-64 sliding eval. No rescore.Length-sort defense (validation batching)
The TTT eval path length-sorts validation docs inside
_build_ttt_global_batchesfor batching efficiency. Each val token is still scored exactly once before any TTT update — Cond 4 holds at the token level via the per-doc chunk window. Merged precedent: this exact pattern is in PR #77 (LoRA_TTT, 2026-03-17) at2026-03-17_LoRA_TTT/train_gpt.py:871(rank_docs.sort(key=lambda d: (d[1] - 2) // chunk_size)). Also used by the PR #1394 / PR #1736 phased-TTT lineage.Logit calibration defense
Train-tokens-only post-quant correction — same class as ValCalib in the PR #1019 lineage. Fitted once, frozen, applied uniformly. No val data, no eval-time learning, no Σ truncation.
Lineage
Run command (3-seed reproduction)
See
README.mdfor the full run command and data-prep step. All env vars match the seed logs.Credits
@codemath3000 (PR #1855 NUM_LOOPS=2), @nprime06 (PR #1787 base), @msisovic (SmearGate BOS-mask catch on PR #1797), @samacqua (PR #1530 base), @romeerp (PR #1729 CaseOps), @bigbag (PR #1493 merged SOTA).