Record: SP8192 + CaseOps + GatedAttn + QuantGate + Loop4-5 + PhasedTTT + MLPClip12 — val_bpb 1.06453 (5-seed mean)#1769
Conversation
5-seed mean val_bpb = 1.06453 (std 0.00068), val_loss = 2.32958 nats/token. −0.00096 BPB vs prior banked submission (1.06549). One-line change from base: default mlp_clip_sigmas in the int6 GPTQ calibration moves from 10.0 to 12.0, preserving MLP outlier-column tail mass that carries signal at int6 with 4x MLP width. All 5 seeds clear the 16,000,000-byte decimal artifact cap (max 15,979,182; 20,818 bytes headroom) and both 600s budgets (train 596.1s, eval 390-401s). 7 seeds were run on this configuration; README and submission.json report the 5 lowest-BPB seeds per competition convention, with full 7-seed disclosure in submission.json.seed_results_all_runs_disclosure. 7-seed mean = 1.06477 (std 0.00069).
…E.md Required reporting fields that were missing from top level of submission.json per the guide's "Required reporting fields" section: - val_loss_nats: 2.329578 (mean) - val_loss_nats_std: 0.00148 - bytes_total: 15,975,561 (mean artifact size across 5 seeds) Also pretty-printed the file (was compact, now indent=2 per convention).
…iculum + MLPClip12 Frontier: openai#1769 (1.06453) and openai#1771 (1.06513) both below baseline. New ideas: mlp-clip-sigmas-12, v-gate. Map updated with openai#1769, openai#1771, openai#1770. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… baseline We've never had 008 pre_gptq.pt + clip=12 + TTT in one run. Spec 009 used clip=10 accidentally. This ~$3 diagnostic establishes whether our pipeline matches openai#1769's 1.06453 or has a systematic gap, before spending more on 8×H100 experiments. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Re-runs spec 009 baseline with explicit MLP_CLIP_SIGMAS=12.0 to determine whether our pipeline matches openai#1769's 1.06453. ~$4, ~10 min, no training. Blocks all further 8×H100 spend until we know our true clean baseline. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Spec: switch to seed 314 (dexhunter's best), add 4xH screen rung, update accept criteria vs openai#1769, fix commit description (025c not 025b), fix sanity greps to match d70888f's actual per-pass constants - Eval 026 seed_42: documents full three-stage gap analysis — gap vs openai#1769 is entirely in float (seed quality), GPTQ/TTT are equivalent or better - Experiments: add row 026 with seed 314 queued - Ideas: mark match-1769-baseline resolved with root cause Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…13) strongest legal signal; dexhunter PR openai#1769 (1.06453) new best; LoRA-TTT warm-start A+alpha=144+WD=1.0 appears legal; arXiv:2604.15259 looped transformer outer normalization; Day 13 plateau; Session 19 https://claude.ai/code/session_013agP2MtwGU9MaPNtWx2hib
External reproductions of PR openai#1769 (and PR openai#1736) failed with ZeroDivisionError in phased TTT eval because the shipped prep script did not prepend the <s> control token (ID 1) to each doc. The SP tokenizer reserves IDs 0-7 (pad/s/</s>/unk + 4 CaseOps operators), so sp.encode cannot emit ID 1 naturally, and train_gpt.py:_find_docs (line 2209) requires BOS markers with no fallback. Training itself ran because _init_shard:408-409 falls back to bos_idx=[0] when no BOS is found; phased TTT eval has no equivalent fallback. Fix: add BOS_ID=1 constant, prepend to each doc's tokens, append 0 to the byte sidecar (BOS = 0 original bytes). Matches the canonical pattern in data/download_hf_docs_and_tokenize.py:364-366. The submitted 1.06453 metric is unaffected — val_bpb reduces to loss_sum/ln(2)/byte_sum (token counts cancel) and byte_sum is unchanged with BOS prepended. Our seed logs were measured on shards that already had BOS markers from an internal prep path; the shipped prep was the outlier. Also adds a Reproduction sanity check section to README.md that asserts bos_count > 0 on the first val shard. Reported by @codemath3000 in PR openai#1736 comment 4285805497.
|
FYI to reviewers — same bug/fix as PR #1736 comment, since this submission ships the same Bug. Scope. Prep-only — submitted 1.06453 is on valid data. Fix. Pushed in commit fe7c309 on this branch: prepend |
Seed logs (train_seed{1,314,777,1337,2025}.log) contained 6 absolute
paths each (data_dir, datasets_dir, tokenizer_path, train_files,
val_files, val_bytes_files) that referenced an internal working
directory. Replace the prefix with `./` so the layout remains
reviewable without leaking internal paths. Code size unchanged.
Also drop `PHASED_TTT_ENABLED=1` from the README Run command — this env
var is not read by train_gpt.py. The two phased-TTT env vars that ARE
read (PHASED_TTT_PREFIX_DOCS, PHASED_TTT_NUM_PHASES) remain. Phased TTT
is gated by the top-level TTT_ENABLED=1 which defaults to on.
Summary
mlp_clip_sigmasin the int6 GPTQ calibration changes from 10.0 → 12.0, preserving MLP outlier-column tail mass that carries signal at int6 with 4× MLP width.Results (5-seed summary)
Disclosure
7 seeds were run on this configuration; this PR reports the 5 lowest-BPB seeds per competition convention, with full 7-seed disclosure inside
submission.json.seed_results_all_runs_disclosure. 7-seed mean = 1.06477 (std 0.00069) — still below the base.Rule compliance
Test plan
MLP_CLIP_SIGMASunset (takes the new default 12.0) via the Run Command in the README.Total submission size< 16,000,000 in the fresh log.