Skip to content

Record: SP8192 CaseOps + TTT + GPTQ + LRZIP — val_bpb 1.05993 (3-seed mean)#1934

Open
liujshi wants to merge 5 commits intoopenai:mainfrom
liujshi:record-lrzip
Open

Record: SP8192 CaseOps + TTT + GPTQ + LRZIP — val_bpb 1.05993 (3-seed mean)#1934
liujshi wants to merge 5 commits intoopenai:mainfrom
liujshi:record-lrzip

Conversation

@liujshi
Copy link
Copy Markdown

@liujshi liujshi commented Apr 29, 2026

Summary

11L 512d 8H/4KV transformer with U-Net skips, parallel residuals (start layer 8), partial RoPE (16 dims, base 10000), depth recurrence (loop layers 3–5, NUM_LOOPS=2), Polar-Express Newton-Schulz Muon optimizer, CaseOps bijective case transform (SP8192), LQER asymmetric INT2/INT4 rank-4 quant correction (top-3 tensors, group 64), sparse attention head-output gate, SmearGate (window 12), fused softcapped CE Triton kernel, GPTQ int6 + int7 embed, per-group lrzip + brotli compression pipeline (COMPRESSOR=pergroup, from PR #1855), phased TTT eval (3 phases, score-first, prefix 2000 docs), with 3 hyperparameter overrides (tightened quant clips + embed weight decay) on 8×H100 SXM.

3-seed mean: 1.05993 BPB (std 0.00059) / 2.31951 nats (std 0.00106) on 8×H100 SXM, all artifacts under the 16 MB cap.

seed post-TTT val_bpb val_loss (nats) artifact bytes eval_time
42 1.05932556 2.31819699 15,979,215 513.3 s
314 1.05993748 2.31953610 15,981,858 525.7 s
999 1.06051274 2.32079499 15,982,243 514.1 s
mean 1.05993 2.31951 15,981,105 517.7 s

vs current leaderboard (1.0810 BPB): −0.02107 BPB / −0.04609 nats.

Changes from PR #1797 base

This submission takes PR #1797 (@dexhunter) as its direct base (which itself extends PR #1787 by @nprime06) and applies three targeted changes:

  1. Per-group lrzip compression (COMPRESSOR=pergroup): replaces PR Record: PR #1787 base + Smear Gate + LQER Asym — val_bpb 1.06157 #1797's default brotli-only compressor with the PR Record: SP8192 + LQER + Sparse Attn Gate + BOS-Fixed SmearGate + 9-Hparam Greedy Stack — val_bpb 1.06108 (3-seed mean) #1855 (@codemath3000) per-group lrzip + brotli pipeline, saving ~280 KB in artifact size.
  2. Tightened quant clips: ATTN_CLIP_SIGMAS 13.0 → 12.0, EMBED_CLIP_SIGMAS 15.0 → 12.0, MLP_CLIP_SIGMAS stays at 12.0. These improve GPTQ quantization fidelity by ~0.001 BPB.
  3. Lower embed weight decay: EMBED_WD 0.085 → 0.06, improving post-quant generalization.

All other hyperparameters are unchanged from PR #1797 defaults (beta2=0.95, warmdown_frac=0.75, ttt_lora_rank=96, ttt_beta2=0.999, ttt_weight_decay=1.0, sparse_attn_gate_scale=1.0, phased_ttt_prefix_docs=2000).

Per-group compression pipeline

PR #1797's base only exposes lzma / brotli compressors. This submission adds a per-group serializer (COMPRESSOR=pergroup) from PR #1855:

  1. Buckets the int6 GPTQ tensors by role (qo_bank, kv_bank, mlp_up_bank, mlp_down_bank, etc.) so similarly-distributed weights compress together.
  2. For "hot" 2D groups (_tok_emb, attn.c_q, mlp.fc), runs an L1 nearest-neighbour similarity sort on rows before transposing — adjacent rows in the serialized stream are now numerically close, giving the entropy coder longer runs of small deltas. Permutation indices are stored as uint16 and brotli-compressed.
  3. Compresses each group blob with lrzip -z -L 9 (ZPAQ context-mixing back-end). lrzip's long-range deduplication catches cross-tensor repetition that brotli's 24-bit window misses.
  4. Falls back to brotli for the remainder (state-dict scaffolding, scales, LQER factors, gate tensors) and the code wrapper.

The lrzip binary must be present on the system (apt-get install lrzip). The script shells out via subprocess.run.

Hyperparameter overrides (vs PR #1797)

hparam this submission PR #1797 default
ATTN_CLIP_SIGMAS 12.0 13.0
EMBED_CLIP_SIGMAS 12.0 15.0
EMBED_WD 0.06 0.085

Full hyperparameter snapshot (from seed 42 log)

hparam value
model_dim 512
num_layers 11
num_heads / num_kv_heads 8 / 4
mlp_mult 4.0
loop_start / loop_end / num_loops 3 / 5 / 2
parallel_start_layer 8
rope_base / rope_dims 10000 / 16
logit_softcap 30.0
matrix_bits / embed_bits 6 / 7
matrix_clip_sigmas 12.85
attn_clip_sigmas 12.0
mlp_clip_sigmas 12.0
embed_clip_sigmas 12.0
matrix_lr / min_lr 0.026 / 0.1
embed_wd / muon_wd 0.06 / 0.095
ema_decay 0.9965
beta1 / beta2 0.9 / 0.95
warmdown_frac 0.75
grad_clip_norm 0.3
qk_gain_init 5.0
gate_window (SmearGate) 12
sparse_attn_gate_scale 1.0
lqer_rank / lqer_top_k / lqer_factor_bits 4 / 3 / 4
lqer_asym_group 64
ttt_lora_rank 96
ttt_beta2 / ttt_weight_decay 0.999 / 1.0
ttt_chunk_size / ttt_batch_size 48 / 64
phased_ttt_num_phases / phased_ttt_prefix_docs 3 / 2000
compressor pergroup
gptq_calibration_batches / gptq_reserve_seconds 16 / 0.5
eval_seq_len / eval_stride 2048 / 64
train_seq_len / train_batch_tokens 2048 / 786432
vocab_size 8192
model_params 35,945,671

Lineage

See README.md in this folder for full architecture details, rule compliance, and credits.

Test plan

  • Trains within 600s wallclock on 8×H100 80GB SXM (4974–4984 steps achieved, ~120.3 ms/step mean)
  • All 3 artifacts under 16 MB cap (max 15,982,243 B; min 15,979,215 B; ~18–21 KB headroom)
  • TTT eval completes within 600s eval cap (max 525.7 s)
  • 3-seed mean 1.05993 BPB reproduced; per-seed numbers verified in attached logs
  • All hyperparameters verified against seed 42 log dump

@liujshi liujshi marked this pull request as ready for review April 29, 2026 16:24
alertcat added a commit to alertcat/parameter-golf that referenced this pull request Apr 29, 2026
…review

Seed 42 v1: FORCE_STOP_STEP=4920 + GPTQ_RESERVE=0.5 -> wallclock 602.048s (borderline)
Seed 42 v2: GPTQ_RESERVE=4.0, no FORCE_STOP_STEP -> wallclock 596.102s (strict <600s)

v2 results:
  seed 42:   val_bpb 1.058675 (was 1.058336 in v1, +0.000339 due to 12 fewer steps)
  seed 0:    val_bpb 1.059394 (unchanged)
  seed 1234: val_bpb 1.060243 (unchanged)
  MEAN:      1.059434 (was 1.059324 in v1, +0.000110)
  STD:       0.000642 (was 0.000780 in v1, TIGHTER)

All 3 seeds now strict <600s wallclock (596.045-596.102s).
All 3 seeds use IDENTICAL config (GPTQ_RESERVE=4.0, no FSS).

Comparisons:
  vs PR openai#1908 frontier (1.06081):  -0.00138 (Welch t=2.18, p=0.045)
  vs PR openai#1855 official openai#1 (1.06108): -0.00165
  vs PR openai#1934 liujshi (1.05993):    -0.00050 (Welch t=0.85, p=0.22, edge of p<0.25)
  vs win threshold (1.06021):       -0.00078
  vs MERGED SOTA bigbag (1.0810):   -0.02157

Compliance: all 3 seeds train+eval strict <600s, artifact <16MB,
3-phase TTT score-first, lossless CaseOps tokenizer, lrzip pergroup.

Files updated:
  - V21_README.md: revised results table + revisions note
  - submission.json: v2 numbers + revisions field
  - train_seed42.log: replaced with strict <600s redo log
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant