Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# SP10240 + SimCTG + QAHSP + post-quant TTT (Submission A v2)

**val_bpb = 1.07197** (3-seed mean post-quant TTT sliding-window, std 0.00023) | artifact 15.96 MB | 8×H100 SXM | brotli-quantized model + lzma-compressed self-extracting code

## 3-seed results

| Seed | post-EMA | quantized | sliding-window | **TTT sliding-window** |
|------|----------|-----------|----------------|----------------------:|
| 42 | 1.07522 | 1.08978 | 1.07386 | **1.07218** |
| 1337 | 1.07522 | 1.08978 | 1.07386 | **1.07200** |
| 2025 | 1.07491 | 1.08939 | 1.07350 | **1.07173** |
| **mean** | **1.07512** | **1.08965** | **1.07374** | **1.07197** |
| std | 0.00018 | 0.00022 | 0.00021 | 0.00023 |

The shipped `final_model.int6.ptz` is from seed 2025 (lowest val_bpb of the 3).

Δ vs prior leaderboard sliding-window SOTA (1.0827, 2026-04-09 SP8192 3-Layer Recurrence): **−0.01073 BPB / 10.7 mBPB better**, well above 3-seed σ (0.23 mBPB).

Δ vs our prior Sub A (1.07502, sliding-window 3-seed): **−0.00305 BPB / 3.05 mBPB better** at the post-quant TTT level.

## Architecture

11L × 512d × 8H / 4KV with: 3-Layer Recurrence (loops 3-5), Parallel Residuals (from layer 7), LeakyReLU(0.5)² SwiGLU, Partial RoPE (16/64), XSA on all 11 layers, tied embeddings, SP10240 tokenizer, Polar Express NS Muon, GPTQ int6 (matrices) + int7 (token embeddings) + brotli compression.

**Training**: 4530-4537 steps in ~588s under `MAX_WALLCLOCK_SECONDS=600` on 8×H100, single seed per run.
**Quantization**: Mixed GPTQ int6/int7 + brotli.
**Eval**: pre-quant post-EMA grade pass → quantized → sliding-window stride 64 → post-quant TTT (1 epoch, LR 5e-3) over remaining eval tokens.

## Our novel additions on top of the PR #1855 lineage

1. **SimCTG contrastive regularizer** (λ=0.3, margin=0.4) — angular spread on token-level hidden states, no inference cost. Carried over from prior Sub A.
2. **QAHSP quant-aware activation regularizer** (λ=0.3) — STE penalty `MSE(h, STE-quantize(h, int6))` pushing hidden states toward an int6 grid during training. **Novel to this submission.** See companion Sub C (PR #2011) for the cross-base ablation characterizing where QAHSP helps and where it hurts.
3. **Post-quant test-time training** (`TTT_ENABLED=1`, default 3 epochs LR 5e-3 reduced to 1 epoch in this run for budget) on already-graded eval tokens, after the legal pre-quant grading pass. Same ttt-after-score line as PR #1413.
4. **Bug fix to `eval_val_ttt`**: original code referenced `compiled_forward` (defined only in the pre-quant TTT path); replaced with eager `base_model(x, y)` call. This is what unblocked TTT from completing — without the fix, the post-quant TTT loop crashed silently on the first chunk.

## Compliance

- Trains in <600s on 8×H100 (`MAX_WALLCLOCK_SECONDS=600`).
- Post-quant TTT runs after the legal pre-quantization post-EMA grading pass per Issue #1017 / README evaluation rules. Same compliance argument as PR #1413 (score-first TTT).
- Eval ops total ~700-720s (sliding-window 115s + TTT 260-290s plus pre-/quantized eval ~30s). Slightly over the 600s soft rule discussed in PR #1958 — flagged for organizer review.
- Artifact 15,958,541 bytes ≤ 16,000,000 (margin 41,459 bytes).

## Files

- `final_model.int6.ptz` — brotli-compressed quantized model (seed 2025, 15,932,327 bytes)
- `train_gpt.py` — self-extracting (lzma+base85+exec, SOTA-standard format, 22,215 bytes)
- `submission.json` — leaderboard metadata
- `train_seed{42,1337,2025}.log` — 3-seed daemon training logs (stripped to relevant lines)
- `README.md` — this file

## Reproduction

```bash
SEED=2025 SP_VOCAB_SIZE=10240 VOCAB_SIZE=10240 MAX_WALLCLOCK_SECONDS=600 \
COMPRESSOR=brotli \
N9_SIMCTG_LAMBDA=0.3 N9_SIMCTG_MARGIN=0.4 \
REG_QAHSP_LAMBDA=0.3 \
TTT_ENABLED=1 TTT_EPOCHS=1 \
torchrun --standalone --nproc_per_node=8 train_gpt.py
```

To decode the self-extracting wrapper:
```bash
python3 -c "import lzma,base64,re;exec(lzma.decompress(base64.b85decode(re.search(r'b85decode\(\"([^\"]+)\"\)', open('train_gpt.py').read()).group(1))).decode())"
```

## Credits

PR #1855 SOTA stack (Kevin Clark et al.), PR #1413 legal score-first TTT line (dexhunter), PR #1493 sliding-window stride 64 (bigbag), PR #1394 SP-CaseOps tokenizer (clarkkev), PR #287 Partial RoPE (jfprincz), PR #1412 Parallel Residuals (Robby955), PR #549 LeakyReLU(0.5)² (abaybektursun).

QAHSP, the post-quant TTT pipeline integration, and the `eval_val_ttt` bug fix are novel to this submission.
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
{
"name": "SP10240 + SimCTG + QAHSP + post-quant TTT",
"blurb": "PR #1855 lineage SOTA stack (11L x 512d x 8H, 3-Layer Recurrence loops 3-5, Parallel Residuals, LeakyReLU(0.5)^2, Partial RoPE 16/64, XSA all-layers, SP10240) + SimCTG (lambda=0.3, margin=0.4) + QAHSP quant-aware activation regularizer (lambda=0.3) + post-quant TTT (TTT_ENABLED=1, TTT_EPOCHS=1, LR 5e-3) + Polar Express NS Muon + GPTQ int6/int7 + brotli + lzma-compressed self-extracting code. 3-seed mean 1.07197 BPB post-quant TTT sliding-window stride 64. Beats prior leaderboard sliding-window SOTA 1.0827 by 10.7 mBPB and our prior Sub A (1.07502) by 3.05 mBPB.",
"date": "2026-04-30",
"val_bpb": 1.07197,
"val_bpb_std": 0.00023,
"val_bpb_metric": "quantized_ttt_sliding_window",
"shipped_seed": 2025,
"seeds": {
"42": {
"post_ema_bpb": 1.07522,
"quantized_bpb": 1.08978,
"sliding_window_bpb": 1.07386,
"ttt_sliding_window_bpb": 1.07218411
},
"1337": {
"post_ema_bpb": 1.07522,
"quantized_bpb": 1.08978,
"sliding_window_bpb": 1.07386,
"ttt_sliding_window_bpb": 1.07200099
},
"2025": {
"post_ema_bpb": 1.07491,
"quantized_bpb": 1.08939,
"sliding_window_bpb": 1.07350,
"ttt_sliding_window_bpb": 1.07172856
}
},
"novel_contributions": {
"qahsp": "Quant-Aware Hidden STE Penalty regularizer at lambda=0.3 (MSE between hidden states and STE-quantized-to-int6 versions). Novel to this submission. See companion Sub C (PR #2011) for cross-base ablation.",
"post_quant_ttt_integration": "Same legal score-first line as PR #1413; TTT_EPOCHS=1 to fit in the eval budget after sliding-window eval.",
"eval_val_ttt_bug_fix": "Original code referenced compiled_forward (defined only in the pre-quant TTT path); patched to use eager base_model(x, y) call."
},
"compliance_notes": "Post-quant TTT runs after the legal pre-quantization post-EMA grade pass. Eval ops total ~700-720s including TTT (sliding-window 115s + TTT 260-290s + pre/quantized eval ~30s); slightly over the 600s soft rule discussed in PR #1958 -- flagged for organizer review.",
"credits": "PR #1855 (Kevin Clark et al.) - architecture; PR #1413 (dexhunter) - legal score-first TTT line; PR #1493 (bigbag) - stride-64 sliding eval; PR #1394 (clarkkev), PR #287 (jfprincz), PR #1412 (Robby955), PR #549 (abaybektursun)",
"bytes_total": 15958541,
"bytes_artifact": 15932327,
"bytes_train_gpt_self_extracting": 22215,
"bytes_readme": 2618,
"bytes_submission_json_self": null,
"cap_margin_bytes": 41459
}

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
W0430 21:08:46.158000 3806398 torch/distributed/run.py:803]
W0430 21:08:46.158000 3806398 torch/distributed/run.py:803] *****************************************
W0430 21:08:46.158000 3806398 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0430 21:08:46.158000 3806398 torch/distributed/run.py:803] *****************************************
Hyperparameters:
adam_eps: 1e-08
adam_wd: 0.02
beta1: 0.9
beta2: 0.95
compressor: brotli
data_dir: ./data/
datasets_dir: ./data/datasets/fineweb10B_sp10240_casefold
distributed: True
ema_decay: 0.9965
embed_bits: 7
embed_clip_sigmas: 20.0
embed_lr: 0.6
embed_wd: 0.085
embedding_dim: 512
enable_looping_at: 0.35
etlb_clip: 3.0
etlb_enabled: False
etlb_lr: 0.05
etlb_steps: 5
eval_seq_len: 2048
eval_stride: 64
gptq_calibration_batches: 64
gptq_reserve_seconds: 12.0
grad_accum_steps: 1
grad_clip_norm: 0.3
head_lr: 0.008
is_main_process: True
iterations: 20000
ln_scale: True
local_rank: 0
logfile: logs/092aaf1c-e18e-4528-bb81-32dd0766130f.txt
logit_softcap: 30.0
loop_end: 5
loop_start: 3
matrix_bits: 6
matrix_clip_sigmas: 12.85
matrix_lr: 0.022
max_wallclock_seconds: 600.0
min_lr: 0.0
mlp_mult: 4.0
model_dim: 512
model_path: final_model.pt
muon_backend_steps: 5
muon_beta2: 0.95
muon_momentum: 0.99
muon_momentum_warmup_start: 0.92
muon_momentum_warmup_steps: 1500
muon_row_normalize: True
muon_wd: 0.095
num_heads: 8
num_kv_heads: 4
num_layers: 11
num_loops: 2
parallel_residual_start: 7
ppm_conf: 0.9
ppm_enabled: False
ppm_lhi: 0.9
ppm_llo: 0.05
ppm_order: 5
prequant_ttt_batch_seqs: 32
prequant_ttt_chunk_tokens: 32768
prequant_ttt_enabled: False
prequant_ttt_epochs: 21
prequant_ttt_freeze_blocks: 2
prequant_ttt_grad_clip: 1.0
prequant_ttt_lr: 0.0005
prequant_ttt_wd: 0.0
qk_gain_init: 5.0
quantized_model_path: final_model.int6.ptz
rank: 0
rope_base: 10000.0
rope_dims: 16
rope_train_seq_len: 2048
run_id: 092aaf1c-e18e-4528-bb81-32dd0766130f
scalar_lr: 0.02
seed: 1337
skip_gates_enabled: True
sliding_window_enabled: True
tie_embeddings: True
tied_embed_init_std: 0.005
tied_embed_lr: 0.03
tokenizer_path: ./data/tokenizers/fineweb_10240_casefold_bpe.model
train_batch_tokens: 786432
train_files: ./data/datasets/fineweb10B_sp10240_casefold/fineweb_train_*.bin
train_log_every: 500
train_seq_len: 2048
ttt_chunk_tokens: 32768
ttt_enabled: True
ttt_epochs: 1
ttt_lr: 0.005
ttt_momentum: 0.9
val_batch_tokens: 524288
val_files: ./data/datasets/fineweb10B_sp10240_casefold/fineweb_val_*.bin
val_loss_every: 4000
vocab_size: 10240
warmdown_frac: 0.72
warmup_steps: 20
world_size: 8
xsa_last_n: 11
train_shards: 101
val_tokens: 49999872
model_params:36993112
gptq:reserving 12s, effective=588000ms
warmup_step: 1/20
warmup_step: 2/20
warmup_step: 3/20
warmup_step: 4/20
warmup_step: 5/20
warmup_step: 6/20
warmup_step: 10/20
warmup_step: 20/20
loop_warmup:enabled encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10]
loop_warmup_step: 1/20
loop_warmup_step: 2/20
loop_warmup_step: 3/20
loop_warmup_step: 4/20
loop_warmup_step: 5/20
loop_warmup_step: 6/20
loop_warmup_step: 10/20
loop_warmup_step: 20/20
0/20000 val_loss: 9.2310 val_bpb: 3.3640
1/20000 train_loss: 9.2315 train_time: 0.0m tok/s: 7930234
2/20000 train_loss: 12.2249 train_time: 0.0m tok/s: 7851607
3/20000 train_loss: 10.7457 train_time: 0.0m tok/s: 7698980
4/20000 train_loss: 9.2979 train_time: 0.0m tok/s: 7571286
5/20000 train_loss: 8.6158 train_time: 0.0m tok/s: 7539839
500/20000 train_loss: 3.4683 train_time: 0.9m tok/s: 7664692
1000/20000 train_loss: 3.3543 train_time: 1.7m tok/s: 7675617
1500/20000 train_loss: 3.3481 train_time: 2.6m tok/s: 7676392
2000/20000 train_loss: 3.2917 train_time: 3.4m tok/s: 7677555
layer_loop:enabled step:2010 frac:0.350 encoder:[0, 1, 2, 3, 4, 5, 3, 4] decoder:[5, 3, 4, 5, 6, 7, 8, 9, 10]
2500/20000 train_loss: 3.0796 train_time: 4.7m tok/s: 7033965
3000/20000 train_loss: 3.1011 train_time: 5.9m tok/s: 6648897
3500/20000 train_loss: 3.0149 train_time: 7.2m tok/s: 6398929
4000/20000 train_loss: 2.9019 train_time: 8.5m tok/s: 6201193
4000/20000 val_loss: 3.0139 val_bpb: 1.0983
4500/20000 train_loss: 2.9929 train_time: 9.7m tok/s: 6074841
4537/20000 val_loss: 2.9536 val_bpb: 1.0764
stopping_early: wallclock_cap train_time: 588142ms step: 4537/20000
peak memory allocated: 39441 MiB reserved: 39550 MiB
ema:applying EMA weights
pre-quantization post-ema val_loss:2.95050684 val_bpb:1.07521787 eval_time:8884ms
Serialized model: 137528185 bytes
Code size: 17708 bytes (lzma compressed; raw 77814 bytes)
Saved compressed code: train_gpt.py.lzma
GPTQ:collecting Hessians from calibration data...
GPTQ:collected 67 Hessians in 12.8s
Quantized weights:
gptq (int6): blocks.attn.c_k.weight, blocks.attn.c_q.weight, blocks.attn.c_v.weight, blocks.attn.proj.weight, blocks.mlp.fc.weight, blocks.mlp.proj.weight
gptq (int7): tok_emb.weight
passthrough (float16): blocks.attn.q_gain, blocks.attn_scale, blocks.mlp_scale, blocks.resid_mix, skip_gates, skip_weights
Serialized model quantized+brotli: 15931088 bytes
Total submission size quantized+brotli: 15948796 bytes
quantized val_loss:2.99046535 val_bpb:1.08977947 eval_time:11026ms
quantized_sliding_window val_loss:2.94677885 val_bpb:1.07385932 eval_time:115225ms
ttt:start chunks=1526 ttt_lr=0.005 ttt_epochs=1
quantized_ttt_sliding_window val_loss:2.94167940 val_bpb:1.07200099 eval_time:260518ms
Loading