Records: SP8192 + LegalTTT 4ep — 1.0729 (Δ -0.0081 vs 04-09, p<1e-7)#1812
Records: SP8192 + LegalTTT 4ep — 1.0729 (Δ -0.0081 vs 04-09, p<1e-7)#1812EthanNing wants to merge 1 commit intoopenai:mainfrom
Conversation
…BPB risk); PR openai#1812 4ep TTT 1.0729; arXiv:2604.21215 validates Triple Loop; Gram-NS CUDA 12.9+ caveat; Session 21 https://claude.ai/code/session_01RwXWBfnCNHi2auKTfvLAdt
|
haha lol thats a very small diff man |
Compliance StatementThis submission conforms to all rules stated in the 1. Score-first TTTREADME rule: "you are only allowed to test-time train on validation set tokens you've already evaluated your model on, since those tokens have already been graded!"
# train_gpt.py, abridged from lines 1342-1408
base_model.eval()
with torch.no_grad(): # no autograd during scoring
...
loss_sum += scored_nll.sum() # score committed FIRST
if not is_last_chunk and h.ttt_epochs > 0: # only then enter training
base_model.train()
for _ep in range(h.ttt_epochs): # 4 epochs over already-scored tokens
...
optimizer.step() # parameter updateThe inner-loop epochs operate strictly on tokens already scored under 2. TokenizerREADME rule: "Submissions that edit the tokenizer will be examined much more carefully... Tokenizer bugs can unjustly improve your score and will result in disqualification." This submission does not modify the tokenizer. It uses the official SP8192 vocabulary unchanged; no special tokens, vocab additions, byte-level pre-processing, or casefold logic are introduced. The tokenizer files are loaded read-only from the standard 3. 16 MB self-contained artifactREADME rule: "Any external data your model needs at eval time must be baked into the 16MB limit"; "The artifact must be fully self-contained and reproducible. No external downloads, training dataset access, or network calls are allowed during evaluation." The submitted 4. ReproducibilityThe submitted seed protocol matches the challenge convention (seeds 42 / 314 / 999); reported 5. Full stack delta vs PR #1493 (transparent disclosure)Acknowledgement: A reviewer ( Diff of
(Items 3 and 4 are conceptually one change — the split MLP weight-decay treatment — implemented as a new param group plus a new knob; they're listed separately for completeness.) Architecture, attention, depth recurrence, parallel residuals, RoPE, embedding, GPTQ bit-widths and clip sigmas, brotli compressor, sliding-window eval, tokenizer, dataset, and seed protocol are all byte-for-byte identical to PR #1493. Per-rule sanity check on each delta
Where the gain comes fromEmpirically, both the optimizer changes (items 3-6, weight decay rebalance + shorter Muon momentum warmup) and the TTT changes (items 1-2, more epochs + larger chunks) contribute. The optimizer changes are responsible for the pre-TTT improvement that We did not include this breakdown in the original PR description, which created the misleading impression that 6. Note on the "epoch threshold" misconceptionA community research log has flagged this PR as exceeding a "≤3 epoch threshold" supposedly established by PR #1413. No such threshold exists in the README or in any maintainer ruling. PR #1413 (merged) happens to use 3 epochs as its choice; the open PR #1514 uses 5. The official rule constrains the score-first ordering, not the epoch count. For reviewer convenience, the unofficial community compliance guide at Issue #1017 (authored by NoesisGenesis, not by maintainers) decomposes the README's score-first rule into four conditions: strict causal dependence, full normalized distribution, score-before-update, single left-to-right pass. This submission satisfies all four; the implementation in |
|
I checked the log, but 4ep may not be the real reason for the performance gain. As shown in the seed999.log: The sliding window bpb is already much lower than the reported score of Apr-9's SOTA record, which is |
- Adds 2-line BOS mask in both forward_logits and forward_ttt SmearGate paths. Before fix, the last token of doc N smeared into the BOS of doc N+1 — model-quality bug, not a C1 issue. Identical fix to PR openai#1851 @aquariouseworkman, audit by @cocohearts. - runpod/phase_g_3seed.sh: full 3-seed driver. Sets PR openai#1797 stack env vars + the PR openai#1855 9-hparam greedy stack delta: MLP_CLIP_SIGMAS=11.5 EMBED_CLIP_SIGMAS=14.0 WARMDOWN_FRAC=0.85 BETA2=0.99 TTT_BETA2=0.99 TTT_WEIGHT_DECAY=0.5 TTT_LORA_RANK=80 SPARSE_ATTN_GATE_SCALE=0.5 PHASED_TTT_PREFIX_DOCS=2500 Mixers (NGRAM/TEMP) stay OFF — pure neural baseline + bug fix + hparam stack. Auto-runs Welch t-test vs PR openai#1797 (1.06157±0.00066). - TTT 4-epoch (PR openai#1812) explicitly NOT adopted: that scheme targets the PR openai#1493 SGD-on-whole-model TTT path, not the PR openai#1797 LoRA-phased per-doc-reset path we're on. No clean mapping. Legality: all 16/16 unit tests still pass. BOS fix preserves causality (it only zeroes a gate at positions where current token is BOS, never references future tokens).
Phase J (one-time data prep, done): - train_sp10240_caseops.py: train SentencePiece BPE at vocab=10240 over CaseOps-transformed FineWeb. Reserves U+E001..U+E005 as user-defined symbols (matches PR openai#1729 / SP8192 reservation set). 96-worker, ~25 min. - prepare_caseops_data_parallel.py with --sp pointing at the new model produces SP10240 caseops shards (~27 GB). Uploaded to private HF dataset hf://FijaEE/parameter-golf-sp10240-caseops (1434 train + 5 val + 5 val_bytes shards). - Tokenizer model + vocab file committed under tokenizers/ for git clone. Phase K (TTT params budget tradeoff, ready to run): - runpod/phase_k_ttt_tradeoff.sh: train SP8192 V2 baseline once on 8xH100 (~10 min, saves model.bin), then run TTT_EVAL_ONLY=1 for 4 configs reusing the saved artifact: K0: grad=1 prefix=2000 phases=3 ctx=2048 (V2 baseline) K1: grad=2 prefix=2000 phases=3 ctx=2048 (oracle, expected over-budget) K2: grad=2 prefix=1500 phases=1 ctx=2048 (cut prefix+phases) K3: grad=2 prefix=2000 phases=3 ctx=1024 (cut ctx) Auto-picks the lowest-BPB config that fits 600s for Phase L. Phase L (3-seed combo, parametrized by Phase K winner): - runpod/phase_l_combo.sh: PR openai#1797 V2 stack + SP10240 + LoRA rank 96 + best TTT params from K. Runs 3 seeds (42, 314, 1234), reports Welch t-test vs PR openai#1797 (1.06157±0.00066) and the 0.005-nat record bar. Hypothesis (per user observation): vocab progression 1024→2048→4096→8192 has been monotonically beneficial; no one in the queue has tried sp10240 without PPM-D. PR openai#1814's lowercase-SP10240 single-seed (1.0742) suggests ~ -0.0015 BPB delta from vocab alone vs PR openai#1797's V2 SP8192 baseline (1.05998 seed-42). Combined with TTT 2-step bump (PR openai#1812 showed 4-epoch delivered -0.008 BPB on a different stack) and LoRA rank 96, total expected ~1.045-1.055 BPB if Phase K finds a feasible budget.
Four post-training specs to stack on 060A's openai#1855 port: - 060I: port PR openai#1908's activation-aware mixed-bit GPTQ (3-seed validated −0.000265 BPB on openai#1855 itself). 4 env vars + ~100 LOC port. - 060J: PHASED_TTT_NUM_PHASES 3→4 (low confidence; openai#1727 measured noise on weaker base, never tested with 2500 prefix). - 060L: PHASED_TTT_PREFIX_DOCS 2500→3000 (high confidence; codemath3000 greedy-validated 2000→2500 on this exact stack in openai#1855). - 060M: TTT_EPOCHS 3→4 (highest predicted Δ; PR openai#1812 reported −0.008 on weaker base; never tested on phased+SmearGate stack like openai#1855). All eval-only via RESUME_FROM_CKPT on 060A's seed_42_4h pt. No code change for 060J/L/M. 060K (rank-up) deleted — rowed against openai#1855's own greedy direction (which decreased rank 96→80). Idea files: research/ideas/{1908-awq-lite-mixed-bit-gptq,ttt-budget-reinvestment}.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
New SOTA on track_10min_16mb: 3-seed mean val_bpb = 1.07290 (std 0.00016),
beating the 2026-04-09 record (1.0810) by 0.00810 nats.
Statistical significance
Delta vs prior 04-09 record
All other components (SP8192, 3-Layer Recurrence, Parallel Residuals, QK-Gain
5.25, MuonEq-R, GPTQ SDClip, Brotli-11) inherited unchanged from
@bigbag's 04-09 SOTA stack.
Compliance
Score-first TTT only. No SLOT, no pre-quant TTT, no n-gram cache, no ETLB,
no tokenizer changes. SP8192 BPE on FineWeb10B default data.
See
records/track_10min_16mb/2026-04-25_SP8192_3LayerRecur_LegalTTT_4ep/README.mdfor full details.