openai · simon-marcus · May 1, 2026 · May 2, 2026
diff --git a/...rds/track_10min_16mb/2026-05-01_PR2014_Leaky03_InTimerNgramTTT_1.0555/README.md b/...rds/track_10min_16mb/2026-05-01_PR2014_Leaky03_InTimerNgramTTT_1.0555/README.md
@@ -0,0 +1,163 @@
+# Corrected: PR #2014 stack + LeakyReLU 0.3 + token-only in-timer n-gram TTT (val_bpb 1.05702)
+
+**Corrected 3-seed mean: val_bpb 1.05701907** | **max 15,989,637 bytes** | 8xH100 SXM | 600s train + in-timer eval
+
+This commit corrects the originally submitted #2140 state. The initial #2140 logs accidentally restored the within-word, word-start, and agreement n-gram channels. The corrected run uses the intended PR #2018 posture: token-only n-gram tilt, with the target-token-gated channels disabled.
+
+## Results
+
+| Seed | Train steps | Pre-quant val_bpb | Quantized val_bpb | Post-TTT val_bpb | Train time | Eval time | Artifact bytes | Notes |
+|------|------------:|------------------:|------------------:|-----------------:|-----------:|----------:|---------------:|-------|
+| 42 | 4,901 | 1.05911087 | 1.06721120 | **1.05590816** | 596.1s | 506.3s | 15,989,637 | token-only in-timer n-gram hints, prefix=2500, chunk=64 |
+| 0 | 4,861 | 1.06126113 | 1.07025102 | **1.05838308** | 596.2s | 553.5s | 15,985,432 | token-only in-timer n-gram hints, prefix=2500, chunk=64 |
+| 314 | 4,855 | 1.06013753 | 1.06808383 | **1.05676598** | 596.1s | 473.2s | 15,983,433 | token-only in-timer n-gram hints, prefix=2500, chunk=64 |
+| **Mean** | **4872.3** | **1.06016984** | **1.06851535** | **1.05701907** | **596.1s** | **511.0s** | **15,989,637 max** | 3 corrected seeds |
+
+Compared with the last merged leaderboard record (#1855, 1.06107587 BPB), this corrected 3-seed mean improves val_bpb by **0.00405680**.
+
+## Summary
+
+This corrected submission starts from the PR #2014 strict-compliance stack and adds two changes:
+
+1. **LeakyReLU-square slope 0.3.** PR #2014 inherited the older LeakyReLU(0.5)^2 MLP slope. This changes the fused/eager LeakyReLU-square path to slope 0.3, following the later PR #1967 lineage.
+2. **Token-only in-timer online n-gram tilt during TTT eval.** This ports the in-timer online n-gram tilt approach we introduced in PR #2018 into the PR #2014 progressive-context / short-doc TTT path. Hints are built causally from validation tokens inside the measured TTT eval timer (`NGRAM_HINT_PRECOMPUTE_OUTSIDE=0`) and applied as a scoring-time posterior adjustment to per-token NLL. The within-word, word-start, and agreement channels are disabled (`WITHIN_BOOST=0.0`, `WORD_BOOST=0.0`, `AGREE_ADD_BOOST=0.0`).
+
+The n-gram path does not add model parameters and has no artifact-size cost beyond source files. The run keeps the PR #2014 global prefix phase (`PHASED_TTT_PREFIX_DOCS=2500`) and uses larger TTT chunks to fit hint construction and scoring inside the 600s eval budget.
+
+## What changed vs PR #2014
+
+| Component | PR #2014 | This submission |
+|-----------|----------|-----------------|
+| Base stack | Progressive 3k context growth + ShortDoc TTT + CaseOps + LQER + AWQ-lite | Same |
+| LeakyReLU-square slope | 0.5 | 0.3 |
+| Eval-time n-gram tilt | off | token-only, causal, in timer |
+| N-gram hint timing | n/a | `NGRAM_HINT_PRECOMPUTE_OUTSIDE=0` |
+| Global phased TTT prefix | 2500 docs | 2500 docs |
+| TTT chunking | 48 / short 24 | 64 / short 32 |
+| Artifact | PR #2014 per-group compressed artifact | same compression path, 15,989,637 bytes max |
+
+## Compliance notes
+
+- **Training cap:** all three corrected seeds stopped under 600s (`596.069s`, `596.182s`, `596.099s`).
+- **Eval cap:** all three corrected final TTT evals are under 600s (`506.254s`, `553.458s`, `473.217s`). All use `NGRAM_HINT_PRECOMPUTE_OUTSIDE=0`, so n-gram hint generation is inside the measured eval timer.
+- **Artifact cap:** max observed `Total submission size quantized+pergroup` is 15,989,637 bytes, under 16 MB.
+- **Score-first TTT:** the LoRA TTT path scores each chunk before any per-doc update. The global prefix SGD phase runs after the prefix docs have already been scored.
+- **N-gram causality:** hints are generated by a single left-to-right pass over validation tokens and aligned to target positions. The tilt uses prefix-derived token hint IDs and boosts; it does not inspect future tokens for the scored position.
+- **Token-only diagnostic:** all corrected evals report `ngram_tilt:hints total=47853343 gated=628156 token_gate=628156 within_gate=0 word_gate=0 agree2plus=0`.
+
+## Key settings
+
+```bash
+CASEOPS_ENABLED=1
+VOCAB_SIZE=8192
+TRAIN_SEQ_LEN=3072
+ROPE_TRAIN_SEQ_LEN=3072
+TRAIN_SEQ_SCHEDULE=1024@0.100,2048@0.700,3072@1.000
+TRAIN_SEQ_SCHEDULE_MODE=wallclock
+SEQ_CHANGE_WARMUP_STEPS=32
+COMPILE_SHAPE_WARMUP=1
+EVAL_SEQ_LEN=3072
+EVAL_STRIDE=1536
+
+LEAKY_RELU_SQ_SLOPE=0.3
+
+TTT_ENABLED=1
+TTT_EVAL_SEQ_LEN=3072
+TTT_BATCH_SIZE=24
+TTT_CHUNK_SIZE=64
+TTT_SHORT_SCORE_FIRST_ENABLED=1
+TTT_SHORT_DOC_LEN=2000
+TTT_SHORT_CHUNK_SIZE=32
+TTT_SHORT_SCORE_FIRST_STEPS=256:16,2000:32
+TTT_LORA_RANK=80
+TTT_LORA_LR=0.0001
+TTT_LOCAL_LR_MULT=0.75
+TTT_MASK=no_qv
+TTT_Q_LORA=0
+TTT_V_LORA=0
+TTT_WEIGHT_DECAY=0.5
+TTT_BETA2=0.99
+PHASED_TTT_PREFIX_DOCS=2500
+PHASED_TTT_NUM_PHASES=1
+
+NGRAM_TILT_ENABLED=1
+NGRAM_HINT_PRECOMPUTE_OUTSIDE=0
+TOKEN_ORDER=16
+TOKEN_THRESHOLD=0.800
+TOKEN_BOOST=2.625
+WITHIN_TAU=0.450
+WITHIN_BOOST=0.0
+WORD_ORDER=4
+WORD_NORMALIZE=strip_punct_lower
+WORD_TAU=0.650
+WORD_BOOST=0.0
+AGREE_ADD_BOOST=0.0
+
+WARMDOWN_FRAC=0.85
+BETA2=0.99
+QK_GAIN_INIT=5.25
+SPARSE_ATTN_GATE_ENABLED=1
+SPARSE_ATTN_GATE_SCALE=0.5
+GATED_ATTN_QUANT_GATE=1
+SMEAR_GATE_ENABLED=1
+GATE_WINDOW=12
+FUSED_CE_ENABLED=1
+MATRIX_LR=0.026
+MIN_LR=0.1
+GRAD_CLIP_NORM=0.3
+EMBED_BITS=7
+EMBED_CLIP_SIGMAS=14.0
+MATRIX_CLIP_SIGMAS=12.85
+ATTN_CLIP_SIGMAS=13.0
+MLP_CLIP_SIGMAS=11.5
+LQER_ENABLED=1
+LQER_RANK=4
+LQER_TOP_K=3
+LQER_FACTOR_BITS=4
+LQER_ASYM_ENABLED=1
+LQER_ASYM_GROUP=64
+AWQ_LITE_ENABLED=1
+AWQ_LITE_BITS=8
+AWQ_LITE_GROUP_TOP_K=1
+AWQ_LITE_GROUP_SIZE=64
+ASYM_LOGIT_RESCALE=1
+COMPRESSOR=pergroup
+GPTQ_RESERVE_SECONDS=4.0
+GPTQ_CALIBRATION_BATCHES=16
+VAL_LOSS_EVERY=0
+```
+
+## Files
+
+- `train_gpt.py` — full script for the candidate.
+- `online_ngram_tilt.py`, `online_ngram_state.c` — online causal n-gram hint builder and scoring-time tilt helper, from the PR #2018 in-timer n-gram tilt work.
+- `train_eval_seed42_corrected_token_only.log` — corrected seed-42 token-only in-timer TTT eval log, using the corrected seed-42 training artifact.
+- `train_eval_seed0_corrected_token_only.log` — corrected seed-0 training, quantization, and token-only in-timer TTT eval log.
+- `train_eval_seed314_corrected_token_only.log` — corrected seed-314 training, quantization, and token-only in-timer TTT eval log.
+- `train_seed42.log`, `eval_seed42_ngram_p0_c64.log`, `eval_seed42_ngram_p2500_c64.log`, `train_eval_seed314.log`, `train_eval_seed0.log` — superseded initial #2140 logs retained for transparency; these used the accidental within-word / word-start / agreement n-gram channels and are not the corrected token-only result.
+- `prepare_caseops_data.py`, `lossless_caps.py`, `tokenizers/...model` — CaseOps data/tokenizer helpers from the merged #1855 lineage.
+- `submission.json` — structured 3-seed metadata.
+
+## Reproducing
+
+After preparing the CaseOps data and tokenizer, run with the environment above:
+
+```bash
+SEED=42 DATA_PATH=./data/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved TOKENIZER_PATH=./data/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
+
+For the eval-only sweep used here, load the saved quantized artifact and run with:
+
+```bash
+TTT_EVAL_ONLY=1 NGRAM_TILT_ENABLED=1 NGRAM_HINT_PRECOMPUTE_OUTSIDE=0 WITHIN_BOOST=0.0 WORD_BOOST=0.0 AGREE_ADD_BOOST=0.0 PHASED_TTT_PREFIX_DOCS=2500 TTT_CHUNK_SIZE=64 TTT_SHORT_CHUNK_SIZE=32 TTT_SHORT_SCORE_FIRST_STEPS=256:16,2000:32 torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
+
+## Credits
+
+This is a stack on top of the recent strict-compliance CaseOps line. Most directly:
+
+- PR #2014 by @simonbissonnette — progressive 3k context growth and short-doc TTT base.
+- PR #2018 — in-timer online n-gram tilt during TTT eval.
+- PR #1967 / PR #1948 lineage — LeakyReLU-square slope 0.3 evidence.
+- PR #1855 by @codemath3000 — merged CaseOps / SparseGate / LQER / per-group compression lineage and data-prep precedent.
+- PR #1787, #1736, #1729, #1667, #1626, #1530, #1344, #493, #478, #315, and #289 for the underlying architecture, optimizer, tokenizer, quantization, compression, and legal score-first TTT components.