Skip to content

Records: SP8192 + LegalTTT 4ep — 1.0729 (Δ -0.0081 vs 04-09, p<1e-7)#1812

Open
EthanNing wants to merge 1 commit intoopenai:mainfrom
EthanNing:record/2026-04-25-sp8192-legal-ttt-4ep
Open

Records: SP8192 + LegalTTT 4ep — 1.0729 (Δ -0.0081 vs 04-09, p<1e-7)#1812
EthanNing wants to merge 1 commit intoopenai:mainfrom
EthanNing:record/2026-04-25-sp8192-legal-ttt-4ep

Conversation

@EthanNing
Copy link
Copy Markdown

Summary

New SOTA on track_10min_16mb: 3-seed mean val_bpb = 1.07290 (std 0.00016),
beating the 2026-04-09 record (1.0810) by 0.00810 nats.

Statistical significance

  • Welch t = -54.93, df = 3.80
  • One-sided p < 1e-7
  • All 3 seeds VALID (artifact under 16 MB, train/eval under 600 s)

Delta vs prior 04-09 record

  • TTT epochs 3 -> 4 (extra adaptation budget under same score-first protocol)
  • Split MLP weight decay (mlp 0.115 / attn 0.095)

All other components (SP8192, 3-Layer Recurrence, Parallel Residuals, QK-Gain
5.25, MuonEq-R, GPTQ SDClip, Brotli-11) inherited unchanged from
@bigbag's 04-09 SOTA stack.

Compliance

Score-first TTT only. No SLOT, no pre-quant TTT, no n-gram cache, no ETLB,
no tokenizer changes. SP8192 BPE on FineWeb10B default data.

See records/track_10min_16mb/2026-04-25_SP8192_3LayerRecur_LegalTTT_4ep/README.md
for full details.

sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Apr 25, 2026
…BPB risk); PR openai#1812 4ep TTT 1.0729; arXiv:2604.21215 validates Triple Loop; Gram-NS CUDA 12.9+ caveat; Session 21

https://claude.ai/code/session_01RwXWBfnCNHi2auKTfvLAdt
@cocohearts
Copy link
Copy Markdown
Collaborator

haha lol thats a very small diff man

@EthanNing
Copy link
Copy Markdown
Author

EthanNing commented Apr 26, 2026

Compliance Statement

This submission conforms to all rules stated in the openai/parameter-golf README. A brief audit follows, organized by rule area, plus a transparent stack-delta disclosure vs the prior SOTA (PR #1493).

1. Score-first TTT

README rule: "you are only allowed to test-time train on validation set tokens you've already evaluated your model on, since those tokens have already been graded!"

train_gpt.py:eval_val_ttt enforces this per chunk: scoring runs under torch.no_grad() and the score is committed to loss_sum before any optimizer step is taken on the chunk's tokens.

# train_gpt.py, abridged from lines 1342-1408
base_model.eval()
with torch.no_grad():                              # no autograd during scoring
    ...
    loss_sum += scored_nll.sum()                   # score committed FIRST
if not is_last_chunk and h.ttt_epochs > 0:         # only then enter training
    base_model.train()
    for _ep in range(h.ttt_epochs):                # 4 epochs over already-scored tokens
        ...
        optimizer.step()                           # parameter update

The inner-loop epochs operate strictly on tokens already scored under torch.no_grad() earlier in the same chunk. Both ttt_epochs and ttt_chunk_tokens are tunable values (see §5 below); neither is a rule dimension.

2. Tokenizer

README rule: "Submissions that edit the tokenizer will be examined much more carefully... Tokenizer bugs can unjustly improve your score and will result in disqualification."

This submission does not modify the tokenizer. It uses the official SP8192 vocabulary unchanged; no special tokens, vocab additions, byte-level pre-processing, or casefold logic are introduced. The tokenizer files are loaded read-only from the standard openai_parameter_golf/data/tokenizers/ path.

3. 16 MB self-contained artifact

README rule: "Any external data your model needs at eval time must be baked into the 16MB limit"; "The artifact must be fully self-contained and reproducible. No external downloads, training dataset access, or network calls are allowed during evaluation."

The submitted train_gpt_packed.py plus quantized model blob measure under 16,000,000 bytes (verified by pack_submission.py post-train). Eval reads only from the val tokens shipped with the challenge; no sidecar files, no auxiliary downloads, no network calls during evaluation.

4. Reproducibility

The submitted seed protocol matches the challenge convention (seeds 42 / 314 / 999); reported val_bpb is the 3-seed mean. The training pipeline is deterministic up to AllReduce / Muon non-determinism inherent to 8×H100 distributed training; the submitted scripts are sufficient to reproduce the reported number.

5. Full stack delta vs PR #1493 (transparent disclosure)

Acknowledgement: A reviewer (@ghrua) correctly observed that the quantized_sliding_window BPB in our seed999.log is 1.0744, which is already lower than PR #1493's reported 1.0827. This means the post-quant pre-TTT improvement is substantial on its own and the delta is NOT explained by ttt_epochs 3 → 4 alone. We agree with this observation and disclose the complete set of differences below.

Diff of train_gpt.py between this PR and train_gpt_old_sota.py (PR #1493 baseline) yields exactly 5 substantive deltas, all hyperparameter-level except for one optimizer-grouping change:

# Knob OLD (#1493) NEW (this PR) Type Compliance angle
1 TTT_EPOCHS 3 4 int Epoch count, not a rule dimension; PR #1413 (merged) uses 3, PR #1514 (open) uses 5 — 4 sits inside that observed range
2 TTT_CHUNK_TOKENS 32768 65536 int (2×) Chunk size only; score-first ordering preserved (all chunk tokens still scored under torch.no_grad() before any train step on the chunk)
3 MUON_WD_MLP (n/a, single group) 0.115 new knob Optimizer hyperparameter; affects only the training phase
4 Muon param-group split single group, all matrix params share muon_wd=0.095 two groups: attn matrices (muon_wd=0.095) + MLP matrices (muon_wd_mlp=0.115) structural Standard PyTorch optimizer convention; no rule dimension
5 ADAM_WD 0.02 0.005 float Optimizer hyperparameter; training-only
6 MUON_MOMENTUM_WARMUP_FRACTION 0.33 0.22 float Optimizer schedule hyperparameter; training-only

(Items 3 and 4 are conceptually one change — the split MLP weight-decay treatment — implemented as a new param group plus a new knob; they're listed separately for completeness.)

Architecture, attention, depth recurrence, parallel residuals, RoPE, embedding, GPTQ bit-widths and clip sigmas, brotli compressor, sliding-window eval, tokenizer, dataset, and seed protocol are all byte-for-byte identical to PR #1493.

Per-rule sanity check on each delta

  • Tokenizer (rule §2): unchanged. None of the 6 deltas touch tokenization.
  • Score-first TTT (rule §1): preserved. ttt_epochs and ttt_chunk_tokens only change the count of training passes and the granularity of the score-then-train cycle; the score-before-update invariant in eval_val_ttt is unchanged (same torch.no_grad() block, same loss_sum commit point, same in-chunk ordering).
  • 16 MB self-contained (rule §3): unaffected. None of the 6 deltas add files, sidecar data, or runtime network calls. Packed artifact is unchanged in shape.
  • Reproducibility (rule §4): unchanged. Same seed protocol, same single train_gpt.py, deterministic up to the same 8×H100 AllReduce/Muon non-determinism as PR Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean) #1493.

Where the gain comes from

Empirically, both the optimizer changes (items 3-6, weight decay rebalance + shorter Muon momentum warmup) and the TTT changes (items 1-2, more epochs + larger chunks) contribute. The optimizer changes are responsible for the pre-TTT improvement that @ghrua correctly identified in the sliding-window log line; the TTT changes contribute the post-quant recovery on top.

We did not include this breakdown in the original PR description, which created the misleading impression that ttt_epochs 3 → 4 was the load-bearing change. The actual driver is the combination above; we apologize for the misleading framing and have updated this compliance statement to disclose the full delta.

6. Note on the "epoch threshold" misconception

A community research log has flagged this PR as exceeding a "≤3 epoch threshold" supposedly established by PR #1413. No such threshold exists in the README or in any maintainer ruling. PR #1413 (merged) happens to use 3 epochs as its choice; the open PR #1514 uses 5. The official rule constrains the score-first ordering, not the epoch count.

For reviewer convenience, the unofficial community compliance guide at Issue #1017 (authored by NoesisGenesis, not by maintainers) decomposes the README's score-first rule into four conditions: strict causal dependence, full normalized distribution, score-before-update, single left-to-right pass. This submission satisfies all four; the implementation in eval_val_ttt (lines 1301-1408) provides the relevant code if a line-by-line walkthrough is helpful.

@ghrua
Copy link
Copy Markdown

ghrua commented Apr 26, 2026

I checked the log, but 4ep may not be the real reason for the performance gain. As shown in the seed999.log:

quantized_sliding_window val_loss:2.77537502 val_bpb:1.07443416 eval_time:129816ms

The sliding window bpb is already much lower than the reported score of Apr-9's SOTA record, which is 1.0827. It is worth noting that the sliding window evaluation has no TTT involved.

Fija pushed a commit to Fija/parameter-golf that referenced this pull request Apr 28, 2026
- Adds 2-line BOS mask in both forward_logits and forward_ttt SmearGate
  paths. Before fix, the last token of doc N smeared into the BOS of doc
  N+1 — model-quality bug, not a C1 issue. Identical fix to PR openai#1851
  @aquariouseworkman, audit by @cocohearts.

- runpod/phase_g_3seed.sh: full 3-seed driver. Sets PR openai#1797 stack env
  vars + the PR openai#1855 9-hparam greedy stack delta:
    MLP_CLIP_SIGMAS=11.5 EMBED_CLIP_SIGMAS=14.0 WARMDOWN_FRAC=0.85
    BETA2=0.99 TTT_BETA2=0.99 TTT_WEIGHT_DECAY=0.5 TTT_LORA_RANK=80
    SPARSE_ATTN_GATE_SCALE=0.5 PHASED_TTT_PREFIX_DOCS=2500
  Mixers (NGRAM/TEMP) stay OFF — pure neural baseline + bug fix +
  hparam stack. Auto-runs Welch t-test vs PR openai#1797 (1.06157±0.00066).

- TTT 4-epoch (PR openai#1812) explicitly NOT adopted: that scheme targets the
  PR openai#1493 SGD-on-whole-model TTT path, not the PR openai#1797 LoRA-phased
  per-doc-reset path we're on. No clean mapping.

Legality: all 16/16 unit tests still pass. BOS fix preserves causality
(it only zeroes a gate at positions where current token is BOS, never
references future tokens).
Fija pushed a commit to Fija/parameter-golf that referenced this pull request Apr 28, 2026
Phase J (one-time data prep, done):
- train_sp10240_caseops.py: train SentencePiece BPE at vocab=10240 over
  CaseOps-transformed FineWeb. Reserves U+E001..U+E005 as user-defined
  symbols (matches PR openai#1729 / SP8192 reservation set). 96-worker, ~25 min.
- prepare_caseops_data_parallel.py with --sp pointing at the new model
  produces SP10240 caseops shards (~27 GB). Uploaded to private HF
  dataset hf://FijaEE/parameter-golf-sp10240-caseops (1434 train + 5 val
  + 5 val_bytes shards).
- Tokenizer model + vocab file committed under tokenizers/ for git clone.

Phase K (TTT params budget tradeoff, ready to run):
- runpod/phase_k_ttt_tradeoff.sh: train SP8192 V2 baseline once on 8xH100
  (~10 min, saves model.bin), then run TTT_EVAL_ONLY=1 for 4 configs
  reusing the saved artifact:
    K0: grad=1 prefix=2000 phases=3 ctx=2048   (V2 baseline)
    K1: grad=2 prefix=2000 phases=3 ctx=2048   (oracle, expected over-budget)
    K2: grad=2 prefix=1500 phases=1 ctx=2048   (cut prefix+phases)
    K3: grad=2 prefix=2000 phases=3 ctx=1024   (cut ctx)
  Auto-picks the lowest-BPB config that fits 600s for Phase L.

Phase L (3-seed combo, parametrized by Phase K winner):
- runpod/phase_l_combo.sh: PR openai#1797 V2 stack + SP10240 + LoRA rank 96 +
  best TTT params from K. Runs 3 seeds (42, 314, 1234), reports Welch
  t-test vs PR openai#1797 (1.06157±0.00066) and the 0.005-nat record bar.

Hypothesis (per user observation): vocab progression 1024→2048→4096→8192
has been monotonically beneficial; no one in the queue has tried sp10240
without PPM-D. PR openai#1814's lowercase-SP10240 single-seed (1.0742) suggests
~ -0.0015 BPB delta from vocab alone vs PR openai#1797's V2 SP8192 baseline
(1.05998 seed-42). Combined with TTT 2-step bump (PR openai#1812 showed 4-epoch
delivered -0.008 BPB on a different stack) and LoRA rank 96, total
expected ~1.045-1.055 BPB if Phase K finds a feasible budget.
amrayach added a commit to amrayach/parameter-golf that referenced this pull request Apr 28, 2026
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 28, 2026
Four post-training specs to stack on 060A's openai#1855 port:

- 060I: port PR openai#1908's activation-aware mixed-bit GPTQ (3-seed validated
  −0.000265 BPB on openai#1855 itself). 4 env vars + ~100 LOC port.
- 060J: PHASED_TTT_NUM_PHASES 3→4 (low confidence; openai#1727 measured noise on
  weaker base, never tested with 2500 prefix).
- 060L: PHASED_TTT_PREFIX_DOCS 2500→3000 (high confidence; codemath3000
  greedy-validated 2000→2500 on this exact stack in openai#1855).
- 060M: TTT_EPOCHS 3→4 (highest predicted Δ; PR openai#1812 reported −0.008 on
  weaker base; never tested on phased+SmearGate stack like openai#1855).

All eval-only via RESUME_FROM_CKPT on 060A's seed_42_4h pt. No code change
for 060J/L/M. 060K (rank-up) deleted — rowed against openai#1855's own greedy
direction (which decreased rank 96→80).

Idea files: research/ideas/{1908-awq-lite-mixed-bit-gptq,ttt-budget-reinvestment}.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants