Record: SP8192 CaseOps + TTT + GPTQ + LRZIP — val_bpb 1.05993 (3-seed mean) by liujshi · Pull Request #1934 · openai/parameter-golf

liujshi · 2026-04-29T16:20:57Z

Summary

11L 512d 8H/4KV transformer with U-Net skips, parallel residuals (start layer 8), partial RoPE (16 dims, base 10000), depth recurrence (loop layers 3–5, NUM_LOOPS=2), Polar-Express Newton-Schulz Muon optimizer, CaseOps bijective case transform (SP8192), LQER asymmetric INT2/INT4 rank-4 quant correction (top-3 tensors, group 64), sparse attention head-output gate, SmearGate (window 12), fused softcapped CE Triton kernel, GPTQ int6 + int7 embed, per-group lrzip + brotli compression pipeline (COMPRESSOR=pergroup, from PR #1855), phased TTT eval (3 phases, score-first, prefix 2000 docs), with 3 hyperparameter overrides (tightened quant clips + embed weight decay) on 8×H100 SXM.

3-seed mean: 1.05993 BPB (std 0.00059) / 2.31951 nats (std 0.00106) on 8×H100 SXM, all artifacts under the 16 MB cap.

seed	post-TTT val_bpb	val_loss (nats)	artifact bytes	eval_time
42	1.05932556	2.31819699	15,979,215	513.3 s
314	1.05993748	2.31953610	15,981,858	525.7 s
999	1.06051274	2.32079499	15,982,243	514.1 s
mean	1.05993	2.31951	15,981,105	517.7 s

vs current leaderboard (1.0810 BPB): −0.02107 BPB / −0.04609 nats.

Changes from PR #1797 base

This submission takes PR #1797 (@dexhunter) as its direct base (which itself extends PR #1787 by @nprime06) and applies three targeted changes:

Per-group lrzip compression (COMPRESSOR=pergroup): replaces PR Record: PR #1787 base + Smear Gate + LQER Asym — val_bpb 1.06157 #1797's default brotli-only compressor with the PR Record: SP8192 + LQER + Sparse Attn Gate + BOS-Fixed SmearGate + 9-Hparam Greedy Stack — val_bpb 1.06108 (3-seed mean) #1855 (@codemath3000) per-group lrzip + brotli pipeline, saving ~280 KB in artifact size.
Tightened quant clips: ATTN_CLIP_SIGMAS 13.0 → 12.0, EMBED_CLIP_SIGMAS 15.0 → 12.0, MLP_CLIP_SIGMAS stays at 12.0. These improve GPTQ quantization fidelity by ~0.001 BPB.
Lower embed weight decay: EMBED_WD 0.085 → 0.06, improving post-quant generalization.

All other hyperparameters are unchanged from PR #1797 defaults (beta2=0.95, warmdown_frac=0.75, ttt_lora_rank=96, ttt_beta2=0.999, ttt_weight_decay=1.0, sparse_attn_gate_scale=1.0, phased_ttt_prefix_docs=2000).

Per-group compression pipeline

PR #1797's base only exposes lzma / brotli compressors. This submission adds a per-group serializer (COMPRESSOR=pergroup) from PR #1855:

Buckets the int6 GPTQ tensors by role (qo_bank, kv_bank, mlp_up_bank, mlp_down_bank, etc.) so similarly-distributed weights compress together.
For "hot" 2D groups (_tok_emb, attn.c_q, mlp.fc), runs an L1 nearest-neighbour similarity sort on rows before transposing — adjacent rows in the serialized stream are now numerically close, giving the entropy coder longer runs of small deltas. Permutation indices are stored as uint16 and brotli-compressed.
Compresses each group blob with lrzip -z -L 9 (ZPAQ context-mixing back-end). lrzip's long-range deduplication catches cross-tensor repetition that brotli's 24-bit window misses.
Falls back to brotli for the remainder (state-dict scaffolding, scales, LQER factors, gate tensors) and the code wrapper.

The lrzip binary must be present on the system (apt-get install lrzip). The script shells out via subprocess.run.

Hyperparameter overrides (vs PR #1797)

hparam	this submission	PR #1797 default
ATTN_CLIP_SIGMAS	12.0	13.0
EMBED_CLIP_SIGMAS	12.0	15.0
EMBED_WD	0.06	0.085

Full hyperparameter snapshot (from seed 42 log)

hparam	value
model_dim	512
num_layers	11
num_heads / num_kv_heads	8 / 4
mlp_mult	4.0
loop_start / loop_end / num_loops	3 / 5 / 2
parallel_start_layer	8
rope_base / rope_dims	10000 / 16
logit_softcap	30.0
matrix_bits / embed_bits	6 / 7
matrix_clip_sigmas	12.85
attn_clip_sigmas	12.0
mlp_clip_sigmas	12.0
embed_clip_sigmas	12.0
matrix_lr / min_lr	0.026 / 0.1
embed_wd / muon_wd	0.06 / 0.095
ema_decay	0.9965
beta1 / beta2	0.9 / 0.95
warmdown_frac	0.75
grad_clip_norm	0.3
qk_gain_init	5.0
gate_window (SmearGate)	12
sparse_attn_gate_scale	1.0
lqer_rank / lqer_top_k / lqer_factor_bits	4 / 3 / 4
lqer_asym_group	64
ttt_lora_rank	96
ttt_beta2 / ttt_weight_decay	0.999 / 1.0
ttt_chunk_size / ttt_batch_size	48 / 64
phased_ttt_num_phases / phased_ttt_prefix_docs	3 / 2000
compressor	pergroup
gptq_calibration_batches / gptq_reserve_seconds	16 / 0.5
eval_seq_len / eval_stride	2048 / 64
train_seq_len / train_batch_tokens	2048 / 786432
vocab_size	8192
model_params	35,945,671

Lineage

PR Record: PR #1787 base + Smear Gate + LQER Asym — val_bpb 1.06157 #1797 (@dexhunter) — direct base: SmearGate + LQER asymmetric rank-4, built on PR Record: PR #1736 + Polar Express NS + MIN_LR + Sparse Attn Gate + Fused CE + PR #1767 TTT — val_bpb 1.06335 #1787.
PR Record: PR #1736 + Polar Express NS + MIN_LR + Sparse Attn Gate + Fused CE + PR #1767 TTT — val_bpb 1.06335 #1787 (@nprime06) — SparseAttnGate + PolarNS + MIN_LR + FusedCE + TTT warm-A.
PR Record: SP8192 + LQER + Sparse Attn Gate + BOS-Fixed SmearGate + 9-Hparam Greedy Stack — val_bpb 1.06108 (3-seed mean) #1855 (@codemath3000) — per-group lrzip compression pipeline.
PR Record: CaseOps Tokenizer + Tapered WD - val_bpb 1.0678 (3-seed mean) #1729 (@romeerp) — CaseOps bijective case transform.
PR Record: Varlen attention + fused MLP + doc-independent TTT (1.07336) #1530 (@samacqua) — Loop4-5 depth recurrence + parallel residuals.
PR Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean) #1394 (merged) — SP8192 + multi-phase score-first TTT baseline.
PR Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #1019 (merged) — byte-level BPB SentencePiece accounting.
PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549 — original modded-nanogpt stack.

See README.md in this folder for full architecture details, rule compliance, and credits.

Test plan

Trains within 600s wallclock on 8×H100 80GB SXM (4974–4984 steps achieved, ~120.3 ms/step mean)
All 3 artifacts under 16 MB cap (max 15,982,243 B; min 15,979,215 B; ~18–21 KB headroom)
TTT eval completes within 600s eval cap (max 525.7 s)
3-seed mean 1.05993 BPB reproduced; per-seed numbers verified in attached logs
All hyperparameters verified against seed 42 log dump

…review Seed 42 v1: FORCE_STOP_STEP=4920 + GPTQ_RESERVE=0.5 -> wallclock 602.048s (borderline) Seed 42 v2: GPTQ_RESERVE=4.0, no FORCE_STOP_STEP -> wallclock 596.102s (strict <600s) v2 results: seed 42: val_bpb 1.058675 (was 1.058336 in v1, +0.000339 due to 12 fewer steps) seed 0: val_bpb 1.059394 (unchanged) seed 1234: val_bpb 1.060243 (unchanged) MEAN: 1.059434 (was 1.059324 in v1, +0.000110) STD: 0.000642 (was 0.000780 in v1, TIGHTER) All 3 seeds now strict <600s wallclock (596.045-596.102s). All 3 seeds use IDENTICAL config (GPTQ_RESERVE=4.0, no FSS). Comparisons: vs PR openai#1908 frontier (1.06081): -0.00138 (Welch t=2.18, p=0.045) vs PR openai#1855 official openai#1 (1.06108): -0.00165 vs PR openai#1934 liujshi (1.05993): -0.00050 (Welch t=0.85, p=0.22, edge of p<0.25) vs win threshold (1.06021): -0.00078 vs MERGED SOTA bigbag (1.0810): -0.02157 Compliance: all 3 seeds train+eval strict <600s, artifact <16MB, 3-phase TTT score-first, lossless CaseOps tokenizer, lrzip pergroup. Files updated: - V21_README.md: revised results table + revisions note - submission.json: v2 numbers + revisions field - train_seed42.log: replaced with strict <600s redo log

liujshi added 3 commits April 22, 2026 18:07

Add new record: SP8192 + Smear + V-Gated

705f779

Add new record: SP8192 + LQER + CaseOp + Per-group_Lrzip

e5b357c

Add new record: SP8192 + LQER + CaseOp + Per-group_Lrzip

a7f5e54

liujshi marked this pull request as ready for review April 29, 2026 16:24

liujshi added 2 commits April 30, 2026 00:31

Add new record: SP8192 + LQER + CaseOp + Per-group_Lrzip

8d40c13

Merge branch 'openai:main' into record-lrzip

ae80c9f

alertcat mentioned this pull request Apr 29, 2026

Record: V22 = V21 + PR #1953 levers + EVAL_SEQ_LEN=2816 -- val_bpb 1.05877 (3-seed mean, all strict <600s) #1945

Open

12 tasks

Christopher-Lee-McClendon mentioned this pull request Apr 29, 2026

Record: Compliant PR #1934 Reproduction (GPTQ_RESERVE=5.5) — val_bpb 1.06003 (3-seed) #1950

Open

ndokutovich mentioned this pull request Apr 30, 2026

Record: V21 + N-gram Tilt + LeakyReLU 0.3 — val_bpb 1.05851 (3-seed mean) #1967

Open

This was referenced Apr 30, 2026

[Non-record] Long-Train Artifact Scaling: post-TTT BPB = 1.0399, artifact size constant across 10–60 min #1979

Open

[Non-Record] 6h Long-Train Scaling + TTT Sweep: Post-TTT BPB 1.03387 #2008

Open

AidenGeunGeun mentioned this pull request May 1, 2026

Add SP8192 CaseOps + 2560 no-Q/V Adaptive Hedge n-gram (1.06083 BPB) #2050

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: SP8192 CaseOps + TTT + GPTQ + LRZIP — val_bpb 1.05993 (3-seed mean)#1934

Record: SP8192 CaseOps + TTT + GPTQ + LRZIP — val_bpb 1.05993 (3-seed mean)#1934
liujshi wants to merge 5 commits intoopenai:mainfrom
liujshi:record-lrzip

liujshi commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

liujshi commented Apr 29, 2026

Summary

Changes from PR #1797 base

Per-group compression pipeline

Hyperparameter overrides (vs PR #1797)

Full hyperparameter snapshot (from seed 42 log)

Lineage

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant