openai · BharathSShankar · Apr 30, 2026
diff --git a/...track_10min_16mb/2026-04-30_SP10240_SimCTG_QAHSP_PostQuantTTT_OptioAI/README.md b/...track_10min_16mb/2026-04-30_SP10240_SimCTG_QAHSP_PostQuantTTT_OptioAI/README.md
@@ -0,0 +1,71 @@
+# SP10240 + SimCTG + QAHSP + post-quant TTT (Submission A v2)
+
+**val_bpb = 1.07197** (3-seed mean post-quant TTT sliding-window, std 0.00023) | artifact 15.96 MB | 8×H100 SXM | brotli-quantized model + lzma-compressed self-extracting code
+
+## 3-seed results
+
+| Seed | post-EMA | quantized | sliding-window | **TTT sliding-window** |
+|------|----------|-----------|----------------|----------------------:|
+| 42   | 1.07522 | 1.08978 | 1.07386 | **1.07218** |
+| 1337 | 1.07522 | 1.08978 | 1.07386 | **1.07200** |
+| 2025 | 1.07491 | 1.08939 | 1.07350 | **1.07173** |
+| **mean** | **1.07512** | **1.08965** | **1.07374** | **1.07197** |
+| std | 0.00018 | 0.00022 | 0.00021 | 0.00023 |
+
+The shipped `final_model.int6.ptz` is from seed 2025 (lowest val_bpb of the 3).
+
+Δ vs prior leaderboard sliding-window SOTA (1.0827, 2026-04-09 SP8192 3-Layer Recurrence): **−0.01073 BPB / 10.7 mBPB better**, well above 3-seed σ (0.23 mBPB).
+
+Δ vs our prior Sub A (1.07502, sliding-window 3-seed): **−0.00305 BPB / 3.05 mBPB better** at the post-quant TTT level.
+
+## Architecture
+
+11L × 512d × 8H / 4KV with: 3-Layer Recurrence (loops 3-5), Parallel Residuals (from layer 7), LeakyReLU(0.5)² SwiGLU, Partial RoPE (16/64), XSA on all 11 layers, tied embeddings, SP10240 tokenizer, Polar Express NS Muon, GPTQ int6 (matrices) + int7 (token embeddings) + brotli compression.
+
+**Training**: 4530-4537 steps in ~588s under `MAX_WALLCLOCK_SECONDS=600` on 8×H100, single seed per run.
+**Quantization**: Mixed GPTQ int6/int7 + brotli.
+**Eval**: pre-quant post-EMA grade pass → quantized → sliding-window stride 64 → post-quant TTT (1 epoch, LR 5e-3) over remaining eval tokens.
+
+## Our novel additions on top of the PR #1855 lineage
+
+1. **SimCTG contrastive regularizer** (λ=0.3, margin=0.4) — angular spread on token-level hidden states, no inference cost. Carried over from prior Sub A.
+2. **QAHSP quant-aware activation regularizer** (λ=0.3) — STE penalty `MSE(h, STE-quantize(h, int6))` pushing hidden states toward an int6 grid during training. **Novel to this submission.** See companion Sub C (PR #2011) for the cross-base ablation characterizing where QAHSP helps and where it hurts.
+3. **Post-quant test-time training** (`TTT_ENABLED=1`, default 3 epochs LR 5e-3 reduced to 1 epoch in this run for budget) on already-graded eval tokens, after the legal pre-quant grading pass. Same ttt-after-score line as PR #1413.
+4. **Bug fix to `eval_val_ttt`**: original code referenced `compiled_forward` (defined only in the pre-quant TTT path); replaced with eager `base_model(x, y)` call. This is what unblocked TTT from completing — without the fix, the post-quant TTT loop crashed silently on the first chunk.
+
+## Compliance
+
+- Trains in <600s on 8×H100 (`MAX_WALLCLOCK_SECONDS=600`).
+- Post-quant TTT runs after the legal pre-quantization post-EMA grading pass per Issue #1017 / README evaluation rules. Same compliance argument as PR #1413 (score-first TTT).
+- Eval ops total ~700-720s (sliding-window 115s + TTT 260-290s plus pre-/quantized eval ~30s). Slightly over the 600s soft rule discussed in PR #1958 — flagged for organizer review.
+- Artifact 15,958,541 bytes ≤ 16,000,000 (margin 41,459 bytes).
+
+## Files
+
+- `final_model.int6.ptz` — brotli-compressed quantized model (seed 2025, 15,932,327 bytes)
+- `train_gpt.py` — self-extracting (lzma+base85+exec, SOTA-standard format, 22,215 bytes)
+- `submission.json` — leaderboard metadata
+- `train_seed{42,1337,2025}.log` — 3-seed daemon training logs (stripped to relevant lines)
+- `README.md` — this file
+
+## Reproduction
+
+```bash
+SEED=2025 SP_VOCAB_SIZE=10240 VOCAB_SIZE=10240 MAX_WALLCLOCK_SECONDS=600 \
+  COMPRESSOR=brotli \
+  N9_SIMCTG_LAMBDA=0.3 N9_SIMCTG_MARGIN=0.4 \
+  REG_QAHSP_LAMBDA=0.3 \
+  TTT_ENABLED=1 TTT_EPOCHS=1 \
+  torchrun --standalone --nproc_per_node=8 train_gpt.py
+```
+
+To decode the self-extracting wrapper:
+```bash
+python3 -c "import lzma,base64,re;exec(lzma.decompress(base64.b85decode(re.search(r'b85decode\(\"([^\"]+)\"\)', open('train_gpt.py').read()).group(1))).decode())"
+```
+
+## Credits
+
+PR #1855 SOTA stack (Kevin Clark et al.), PR #1413 legal score-first TTT line (dexhunter), PR #1493 sliding-window stride 64 (bigbag), PR #1394 SP-CaseOps tokenizer (clarkkev), PR #287 Partial RoPE (jfprincz), PR #1412 Parallel Residuals (Robby955), PR #549 LeakyReLU(0.5)² (abaybektursun).
+
+QAHSP, the post-quant TTT pipeline integration, and the `eval_val_ttt` bug fix are novel to this submission.
diff --git a/...rack_10min_16mb/2026-04-30_SP10240_SimCTG_QAHSP_PostQuantTTT_OptioAI/final_model.int6.ptz b/...rack_10min_16mb/2026-04-30_SP10240_SimCTG_QAHSP_PostQuantTTT_OptioAI/final_model.int6.ptz
diff --git a/...rds/track_10min_16mb/2026-04-30_SP10240_SimCTG_QAHSP_PostQuantTTT_OptioAI/submission.json b/...rds/track_10min_16mb/2026-04-30_SP10240_SimCTG_QAHSP_PostQuantTTT_OptioAI/submission.json
@@ -0,0 +1,42 @@
+{
+  "name": "SP10240 + SimCTG + QAHSP + post-quant TTT",
+  "blurb": "PR #1855 lineage SOTA stack (11L x 512d x 8H, 3-Layer Recurrence loops 3-5, Parallel Residuals, LeakyReLU(0.5)^2, Partial RoPE 16/64, XSA all-layers, SP10240) + SimCTG (lambda=0.3, margin=0.4) + QAHSP quant-aware activation regularizer (lambda=0.3) + post-quant TTT (TTT_ENABLED=1, TTT_EPOCHS=1, LR 5e-3) + Polar Express NS Muon + GPTQ int6/int7 + brotli + lzma-compressed self-extracting code. 3-seed mean 1.07197 BPB post-quant TTT sliding-window stride 64. Beats prior leaderboard sliding-window SOTA 1.0827 by 10.7 mBPB and our prior Sub A (1.07502) by 3.05 mBPB.",
+  "date": "2026-04-30",
+  "val_bpb": 1.07197,
+  "val_bpb_std": 0.00023,
+  "val_bpb_metric": "quantized_ttt_sliding_window",
+  "shipped_seed": 2025,
+  "seeds": {
+    "42": {
+      "post_ema_bpb": 1.07522,
+      "quantized_bpb": 1.08978,
+      "sliding_window_bpb": 1.07386,
+      "ttt_sliding_window_bpb": 1.07218411
+    },
+    "1337": {
+      "post_ema_bpb": 1.07522,
+      "quantized_bpb": 1.08978,
+      "sliding_window_bpb": 1.07386,
+      "ttt_sliding_window_bpb": 1.07200099
+    },
+    "2025": {
+      "post_ema_bpb": 1.07491,
+      "quantized_bpb": 1.08939,
+      "sliding_window_bpb": 1.07350,
+      "ttt_sliding_window_bpb": 1.07172856
+    }
+  },
+  "novel_contributions": {
+    "qahsp": "Quant-Aware Hidden STE Penalty regularizer at lambda=0.3 (MSE between hidden states and STE-quantized-to-int6 versions). Novel to this submission. See companion Sub C (PR #2011) for cross-base ablation.",
+    "post_quant_ttt_integration": "Same legal score-first line as PR #1413; TTT_EPOCHS=1 to fit in the eval budget after sliding-window eval.",
+    "eval_val_ttt_bug_fix": "Original code referenced compiled_forward (defined only in the pre-quant TTT path); patched to use eager base_model(x, y) call."
+  },
+  "compliance_notes": "Post-quant TTT runs after the legal pre-quantization post-EMA grade pass. Eval ops total ~700-720s including TTT (sliding-window 115s + TTT 260-290s + pre/quantized eval ~30s); slightly over the 600s soft rule discussed in PR #1958 -- flagged for organizer review.",
+  "credits": "PR #1855 (Kevin Clark et al.) - architecture; PR #1413 (dexhunter) - legal score-first TTT line; PR #1493 (bigbag) - stride-64 sliding eval; PR #1394 (clarkkev), PR #287 (jfprincz), PR #1412 (Robby955), PR #549 (abaybektursun)",
+  "bytes_total": 15958541,
+  "bytes_artifact": 15932327,
+  "bytes_train_gpt_self_extracting": 22215,
+  "bytes_readme": 2618,
+  "bytes_submission_json_self": null,
+  "cap_margin_bytes": 41459
+}
diff --git a/records/track_10min_16mb/2026-04-30_SP10240_SimCTG_QAHSP_PostQuantTTT_OptioAI/train_gpt.py b/records/track_10min_16mb/2026-04-30_SP10240_SimCTG_QAHSP_PostQuantTTT_OptioAI/train_gpt.py
diff --git a/.../track_10min_16mb/2026-04-30_SP10240_SimCTG_QAHSP_PostQuantTTT_OptioAI/train_seed1337.log b/.../track_10min_16mb/2026-04-30_SP10240_SimCTG_QAHSP_PostQuantTTT_OptioAI/train_seed1337.log
@@ -0,0 +1,162 @@
+W0430 21:08:46.158000 3806398 torch/distributed/run.py:803] 
+W0430 21:08:46.158000 3806398 torch/distributed/run.py:803] *****************************************
+W0430 21:08:46.158000 3806398 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
+W0430 21:08:46.158000 3806398 torch/distributed/run.py:803] *****************************************
+Hyperparameters:
+  adam_eps: 1e-08
+  adam_wd: 0.02
+  beta1: 0.9
+  beta2: 0.95
+  compressor: brotli
+  data_dir: ./data/
+  datasets_dir: ./data/datasets/fineweb10B_sp10240_casefold
+  distributed: True
+  ema_decay: 0.9965
+  embed_bits: 7
+  embed_clip_sigmas: 20.0
+  embed_lr: 0.6
+  embed_wd: 0.085
+  embedding_dim: 512
+  enable_looping_at: 0.35
+  etlb_clip: 3.0
+  etlb_enabled: False
+  etlb_lr: 0.05
+  etlb_steps: 5
+  eval_seq_len: 2048
+  eval_stride: 64
+  gptq_calibration_batches: 64
+  gptq_reserve_seconds: 12.0
+  grad_accum_steps: 1
+  grad_clip_norm: 0.3
+  head_lr: 0.008
+  is_main_process: True
+  iterations: 20000
+  ln_scale: True
+  local_rank: 0
+  logfile: logs/092aaf1c-e18e-4528-bb81-32dd0766130f.txt
+  logit_softcap: 30.0
+  loop_end: 5
+  loop_start: 3
+  matrix_bits: 6
+  matrix_clip_sigmas: 12.85
+  matrix_lr: 0.022
+  max_wallclock_seconds: 600.0
+  min_lr: 0.0
+  mlp_mult: 4.0
+  model_dim: 512
+  model_path: final_model.pt
+  muon_backend_steps: 5
+  muon_beta2: 0.95
+  muon_momentum: 0.99
+  muon_momentum_warmup_start: 0.92
+  muon_momentum_warmup_steps: 1500
+  muon_row_normalize: True
+  muon_wd: 0.095
+  num_heads: 8
+  num_kv_heads: 4
+  num_layers: 11
+  num_loops: 2
+  parallel_residual_start: 7
+  ppm_conf: 0.9
+  ppm_enabled: False
+  ppm_lhi: 0.9
+  ppm_llo: 0.05
+  ppm_order: 5
+  prequant_ttt_batch_seqs: 32
+  prequant_ttt_chunk_tokens: 32768
+  prequant_ttt_enabled: False
+  prequant_ttt_epochs: 21
+  prequant_ttt_freeze_blocks: 2
+  prequant_ttt_grad_clip: 1.0
+  prequant_ttt_lr: 0.0005
+  prequant_ttt_wd: 0.0
+  qk_gain_init: 5.0
+  quantized_model_path: final_model.int6.ptz
+  rank: 0
+  rope_base: 10000.0
+  rope_dims: 16
+  rope_train_seq_len: 2048
+  run_id: 092aaf1c-e18e-4528-bb81-32dd0766130f
+  scalar_lr: 0.02
+  seed: 1337
+  skip_gates_enabled: True
+  sliding_window_enabled: True
+  tie_embeddings: True
+  tied_embed_init_std: 0.005
+  tied_embed_lr: 0.03
+  tokenizer_path: ./data/tokenizers/fineweb_10240_casefold_bpe.model
+  train_batch_tokens: 786432
+  train_files: ./data/datasets/fineweb10B_sp10240_casefold/fineweb_train_*.bin
+  train_log_every: 500
+  train_seq_len: 2048
+  ttt_chunk_tokens: 32768
+  ttt_enabled: True
+  ttt_epochs: 1
+  ttt_lr: 0.005
+  ttt_momentum: 0.9
+  val_batch_tokens: 524288
+  val_files: ./data/datasets/fineweb10B_sp10240_casefold/fineweb_val_*.bin
+  val_loss_every: 4000
+  vocab_size: 10240
+  warmdown_frac: 0.72
+  warmup_steps: 20
+  world_size: 8
+  xsa_last_n: 11
+train_shards: 101
+val_tokens: 49999872
+model_params:36993112
+gptq:reserving 12s, effective=588000ms
+warmup_step: 1/20
+warmup_step: 2/20
+warmup_step: 3/20
+warmup_step: 4/20
+warmup_step: 5/20
+warmup_step: 6/20
+warmup_step: 10/20
+warmup_step: 20/20
+loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10]
+loop_warmup_step: 1/20
+loop_warmup_step: 2/20
+loop_warmup_step: 3/20
+loop_warmup_step: 4/20
+loop_warmup_step: 5/20
+loop_warmup_step: 6/20
+loop_warmup_step: 10/20
+loop_warmup_step: 20/20
+0/20000 val_loss: 9.2310 val_bpb: 3.3640
+1/20000 train_loss: 9.2315 train_time: 0.0m tok/s: 7930234
+2/20000 train_loss: 12.2249 train_time: 0.0m tok/s: 7851607
+3/20000 train_loss: 10.7457 train_time: 0.0m tok/s: 7698980
+4/20000 train_loss: 9.2979 train_time: 0.0m tok/s: 7571286
+5/20000 train_loss: 8.6158 train_time: 0.0m tok/s: 7539839
+500/20000 train_loss: 3.4683 train_time: 0.9m tok/s: 7664692
+1000/20000 train_loss: 3.3543 train_time: 1.7m tok/s: 7675617
+1500/20000 train_loss: 3.3481 train_time: 2.6m tok/s: 7676392
+2000/20000 train_loss: 3.2917 train_time: 3.4m tok/s: 7677555
+layer_loop:enabled step:2010 frac:0.350 encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10]
+2500/20000 train_loss: 3.0796 train_time: 4.7m tok/s: 7033965
+3000/20000 train_loss: 3.1011 train_time: 5.9m tok/s: 6648897
+3500/20000 train_loss: 3.0149 train_time: 7.2m tok/s: 6398929
+4000/20000 train_loss: 2.9019 train_time: 8.5m tok/s: 6201193
+4000/20000 val_loss: 3.0139 val_bpb: 1.0983
+4500/20000 train_loss: 2.9929 train_time: 9.7m tok/s: 6074841
+4537/20000 val_loss: 2.9536 val_bpb: 1.0764
+stopping_early: wallclock_cap train_time: 588142ms step: 4537/20000
+peak memory allocated: 39441 MiB reserved: 39550 MiB
+ema:applying EMA weights
+pre-quantization post-ema val_loss:2.95050684 val_bpb:1.07521787 eval_time:8884ms
+Serialized model: 137528185 bytes
+Code size: 17708 bytes (lzma compressed; raw 77814 bytes)
+Saved compressed code: train_gpt.py.lzma
+GPTQ:collecting Hessians from calibration data...
+GPTQ:collected 67 Hessians in 12.8s
+Quantized weights:
+  gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight
+  gptq (int7): tok_emb.weight
+  passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, skip_gates, skip_weights
+Serialized model quantized+brotli: 15931088 bytes
+Total submission size quantized+brotli: 15948796 bytes
+quantized val_loss:2.99046535 val_bpb:1.08977947 eval_time:11026ms
+quantized_sliding_window val_loss:2.94677885 val_bpb:1.07385932 eval_time:115225ms
+ttt:start chunks=1526 ttt_lr=0.005 ttt_epochs=1
+quantized_ttt_sliding_window val_loss:2.94167940 val_bpb:1.07200099 eval_time:260518ms