Skip to content

Record: PreQuantTTT + Sliding Window on PR #1855 stack, val_bpb=1.01355 (3-seed)#1958

Closed
okezue wants to merge 1 commit intoopenai:mainfrom
okezue:prequant-ttt-sliding-submission
Closed

Record: PreQuantTTT + Sliding Window on PR #1855 stack, val_bpb=1.01355 (3-seed)#1958
okezue wants to merge 1 commit intoopenai:mainfrom
okezue:prequant-ttt-sliding-submission

Conversation

@okezue
Copy link
Copy Markdown

@okezue okezue commented Apr 30, 2026

RECORD SUBMISSION

val_bpb = 1.01355 (3-seed mean, std 0.00038) on track_10min_16mb. Beats the prior 3-seed merged record (1.0810 from the 2026-04-09 SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain + Legal TTT submission) by 0.0675 BPB, well above the 0.005-nat threshold and clearing p < 0.01 with 3 seeds.

cc @cocohearts @valerio-oai for record review.

Per-seed

seed val_bpb artifact bytes train ms eval ops ms
42 1.01398 15,911,549 599,654 365,878
314 1.01341 15,913,072 599,584 366,030
999 1.01325 15,913,599 599,588 367,711
mean 1.01355 15,912,740
std 0.00038

Stack

  1. PR Record: SP8192 + LQER + Sparse Attn Gate + BOS-Fixed SmearGate + 9-Hparam Greedy Stack — val_bpb 1.06108 (3-seed mean) #1855 SOTA base (@codemath3000): SP8192 CaseOps tokenizer + 36M GPT with looped encoder/decoder + Polar Express NS Muon (PR Record: SP4096 + Polar Express + MuonEq-R + Depth Recurrence — 1.0923 BPB (3-seed) #1344) + LQER + BOS-fixed SmearGate + 9-hyperparameter greedy stack (EMBED_BITS=7, MIN_LR=0.1, MLP_CLIP_SIGMAS=11.5, EMBED_CLIP_SIGMAS=14.0, WARMDOWN_FRAC=0.85, BETA2=0.99, TTT_BETA2=0.99, TTT_WEIGHT_DECAY=0.5, TTT_LORA_RANK=80).
  2. PR {RECORD} CaseOps pre-quant TTT record (1.0354 BPB) #1911 Pre-quantization AdamW TTT: 21 epochs over the full validation set after the legality-grading pre-quantization post-ema eval. Freezes blocks 0-1 and tok_emb.weight. Federated AVG of remaining params across 8 GPUs after each epoch. Cosine LR 5e-4 to 5e-5. Drops BF16 val_bpb from 1.064 (post-EMA) to 0.999.
  3. PR Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean) #1493 sliding-window stride-64 eval on the post-GPTQ model. Single-pass, strictly causal. Drops post-quant val_bpb from 1.023 to 1.014.
  4. Per-group lrzip compression (PR Record: Per-Layer Adaptive GPTQ Clip + int7 Embeddings + MLR 0.026 — val_bpb 1.07493 (3-seed mean) #1586 / RECORD: SmearGate + Attention Output Gate + Legal TTT | val_bpb=1.07139 #1667 / Record: CaseOps Tokenizer + Tapered WD - val_bpb 1.0678 (3-seed mean) #1729) on the GPTQ tensor blob. Saves ~236 KB versus brotli-11, which is the difference between 16,148,947 bytes (over the 16,000,000 byte decimal cap with brotli) and 15,913,072 bytes (under the cap with pergroup).

Compliance

For every seed:

  • Train ≤ 600,000 ms (capped at 599,654 / 599,584 / 599,588 ms)
  • Eval ops ≤ 600,000 ms (365,878 / 366,030 / 367,711 ms)
  • Artifact ≤ 16,000,000 bytes (15,911,549 / 15,913,072 / 15,913,599)
  • 8xH100 SXM
  • No SLOT, no n-gram cache, no logit bias, no ETLB
  • Standard softmax over the SP8192 alphabet at every scored position
  • Single-pass: each val token contributes exactly one BPB term in the final quantized_sliding_window score
  • Pre-quant TTT runs after the pre-quantization post-ema eval grades the val tokens, satisfying the README rule that TTT may only run on already-evaluated val tokens
  • Sliding-window eval is causal: position t is scored from prefix [t - (seq_len - stride), t), never from t itself or from positions to the right of t

Credits

Reproduction

pip install brotli sentencepiece
pip install flash_attn_3 --no-deps --find-links https://windreamer.github.io/flash-attention3-wheels/cu129_torch291/
apt-get install -y lrzip

MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py \
  --variant sp8192_lossless_caps_caseops_v1_reserved

for SEED in 42 314 999; do
  SEED=$SEED \
    CASEOPS_ENABLED=1 COMPRESSOR=pergroup \
    SMEAR_GATE_ENABLED=1 SPARSE_ATTN_GATE_ENABLED=1 SPARSE_ATTN_GATE_SCALE=0.5 \
    EMBED_BITS=7 MIN_LR=0.1 GPTQ_RESERVE_SECONDS=0.5 \
    MLP_CLIP_SIGMAS=11.5 EMBED_CLIP_SIGMAS=14.0 WARMDOWN_FRAC=0.85 \
    BETA2=0.99 TTT_BETA2=0.99 TTT_WEIGHT_DECAY=0.5 TTT_LORA_RANK=80 LOGIT_SOFTCAP=15 \
    TTT_ENABLED=0 SLIDING_WINDOW_ENABLED=1 EVAL_STRIDE=64 \
    PREQUANT_TTT_ENABLED=1 PREQUANT_TTT_EPOCHS=21 PREQUANT_TTT_LR=5e-4 \
    torchrun --standalone --nproc_per_node=8 train_gpt.py
done

Full README, submission.json, train_gpt.py, lossless_caps.py, and the 3 train logs are in records/track_10min_16mb/2026-04-28_PreQuantTTT_on_SOTA/.

…l_bpb=1.01355 (3-seed)

3-seed mean 1.01355 BPB (std 0.00038) on track_10min_16mb. Beats prior 3-seed merged record (1.0810) by 0.0675 BPB, well above the 0.005 nat threshold.

Stack: PR openai#1855 base (SP8192 + LQER + SmearGate + 9-hp greedy) plus PR openai#1911 pre-quantization AdamW TTT (21 epochs, federated AVG across 8 GPUs, freeze first 2 blocks and tok_emb, cosine 5e-4 to 5e-5) plus PR openai#1493 sliding-window stride-64 eval. Artifact uses lrzip per-group compression (PR openai#1586 line) to fit under the 16,000,000 byte decimal cap.

Per-seed:
  42:  val_bpb=1.01398, artifact=15911549 bytes, train=599654 ms, eval ops=365878 ms
  314: val_bpb=1.01341, artifact=15913072 bytes, train=599584 ms, eval ops=366030 ms
  999: val_bpb=1.01325, artifact=15913599 bytes, train=599588 ms, eval ops=367711 ms

All 3 seeds: train under 600s, eval under 600s, artifact under 16,000,000 bytes. Pre-quant TTT runs after the pre-quantization legality grade on val tokens, satisfying the README rule "you are only allowed to test-time train on validation set tokens you've already evaluated your model on".
@anmarhindi
Copy link
Copy Markdown

Looks like a C3. The pre_quant_adamw_ttt function runs 21 epochs of AdamW directly on the validation token stream, updating most of the model's parameters, before the final quantized_sliding_window eval grades those same tokens. That's score-after-adapt, not score-first.

@okezue
Copy link
Copy Markdown
Author

okezue commented Apr 30, 2026

Withdrawing this submission. Reviewer correctly flagged C3 violation: pre_quant_adamw_ttt runs 21 epochs of AdamW on the full validation token stream before the final quantized_sliding_window eval reports the leaderboard number. Even though a diagnostic pre-quantization post-ema eval grades the val tokens first, the reported grade is from a model that has trained on those tokens. That's score-after-adapt and breaks the rule that the reported per-token BPB must come from a forward pass by a model that has not yet TTT-trained on that token. Reposting a clean legal submission shortly with post-quant LoRA distillation on train data only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants