Non-record: Quinary quantization + SP16384 + per-group lrzip + TTT - bpb 1.1384 by deniskurlov · Pull Request #2086 · openai/parameter-golf

deniskurlov · 2026-05-01T04:01:01Z

Quinary {-2,-1,0,+1,+2} weights (5-state, base-5 packed) + 10L (5 Encoder + 5 Decoder) 576d U-Net + Muon + 4× relu² MLP + Tied Embed (380→576) + Poly5 Softcap + YaRN 2048 + SP16384 BPE + FP8 QAT + 5-bit Scale Quant + Layout-Aware Per-Stream Archive + Score-First TTT (3 epochs, fp16-calibration-only)

bpb 1.1384 ± 0.0009 std (3-seed TTT mean) | 15.72 MB total artifact max (all 3 seeds FIT) | 8×H100 SXM, 7,800 steps in 599s + ~3.6 min TTT-eval

Results (3 seeds, 8×H100 SXM)

Seed	TTT BPB	RT BPB	Total bytes
42	1.1381	1.1626	15,714,938
1337	1.1394	1.1633	15,721,124
7	1.1378	1.1622	15,724,839
Mean ± std	1.1384 ± 0.0009	1.1627 ± 0.0006

Non-record submission to track_non_record_16mb. Direct quinary {-2,-1,0,+1,+2} fork of @CiprianFlorin-Ifrim's 2026-03-24 ternary record (PR #640, 1.1570 sliding BPB), exploring whether one tick up the discrete-weight axis (3 → 5 levels per parameter) buys more than it costs at fixed compute and 16 MB budget.

Architecture (config)


Layers	10 (5 encoder + 5 decoder, symmetric U-Net)
Model dim	576
Heads	6 query / 3 KV (GQA), head_dim=96
MLP	4× expansion, hidden=2304, relu² activation
Embed	tied, 16384 vocab, 380→576 bottleneck
RoPE	YaRN, base=5000, max_len=2048
Softcap	poly5, cap=10
Quinary group size	192, per-group absmean
Optimizer	Muon (matrix params), Adam (scalars + tied embed)
Batch / seq	524 288 tok / 1024
Wallclock cap	599 s

Package contents

records/track_non_record_16mb/2026-04-30_Quinary_53M_10L_576d_SP16384_TTT/:

README.md — full Diff-from-ternary-record table, BPB-denominator audit methodology, validation accounting
submission.json — metadata + per-seed seed_results with verified eval_tokens / eval_bytes
setup.sh, run.sh, requirements.txt, train_gpt.py — reproducer (a bare torchrun reproduces the canonical config; Hyperparameters defaults match run.sh line-for-line)
verify_bpb.py — standalone reviewer-runnable BPB-LUT-vs-SentencePiece-decoder check (256-int32 header loader matching train_gpt.py, exact eval-slice check, BOS-delimited docwise decoder, tokenizer SHA-256, UNK guard)
fineweb_16384_bpe.model + .vocab — bundled tokenizer (sha256 abaec140336…ac432a); also published at deniskurlov/parameter-golf-fineweb-sp16384
quinary_seed{42,1337,7}.txt — three full per-seed training/TTT logs

Reproducer

cd records/track_non_record_16mb/2026-04-30_Quinary_53M_10L_576d_SP16384_TTT
bash setup.sh                           # apt lrzip + pip + FA3 + ~23 GB HF download                  
SEED=42 bash run.sh                     # ~10 min train + ~3.6 min TTT eval                           
python3 verify_bpb.py                   # ~1 min, BPB-LUT denominator audit

Compliance

bytes_total ≤ 16,000,000 (max-seed = 15,724,839 — seed=7; margin 275,161)
Training ≤ 10 min (~599s wallclock × 3 seeds)
Evaluation ≤ 10 min (TTT 213–216 s + roundtrip ~80 s)
Score-first TTT on the fp16 calibration parameters only (~42k = 0.08% of the model); quinary weights frozen
BPB byte-count LUT verified against SentencePiece decoder on the exact scored slice (delta=+0)
Reproducibly runs end-to-end from bash setup.sh && bash run.sh on a fresh 8×H100 SXM pod (verified on a fully-nuked + re-cloned pod 2026-05-01)

Acknowledgements

@CiprianFlorin-Ifrim — Record Submission: 1.1570 BPB - 73.7M Ternary U-Net + NeoMuon + 4x relu²MLP + Factored Tied Emb + Poly5 Softcap + YaRN2048 + 8192BPE + FP8QAT + Bitmask-LZMA + Stride-16 Sliding #640 ternary record — architectural foundation (U-Net topology, Muon, FP8 QAT, factored tied embedding, polynomial-5 softcap, YaRN, fused QKV, FlashAttention-3)
@codemath3000 — Record: SP8192 + LQER + Sparse Attn Gate + BOS-Fixed SmearGate + 9-Hparam Greedy Stack — val_bpb 1.06108 (3-seed mean) #1855 per-group lrzip+brotli compression — structural inspiration for our per-stream archive (split-the-blob + layout-as-axis + lrzip-ZPAQ)
@abaybektursun + @clarkkev — Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549 + Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean) #1394 — score-first TTT

@CiprianFlorin-Ifrim

…1.1384 BPB, 3-seed) 3-seed mean TTT BPB 1.1384 ± 0.0009 std on 8×H100 SXM (10 min training) in records/track_non_record_16mb/2026-04-30_Quinary_53M_10L_576d_SP16384_TTT/. Direct quinary {-2,-1,0,+1,+2} fork of @CiprianFlorin-Ifrim's 2026-03-24 ternary record (PR openai#640): inherits the U-Net topology + Muon + factored tied embedding + FP8 QAT + YaRN + FlashAttention-3, swaps ternary -> quinary (-0.021 BPB roundtrip BPB), SP8192 -> SP16384 tokenizer, single-blob LZMA -> layout-aware per-stream archive (structurally based on @codemath3000's PR openai#1855), and stride-16 sliding eval -> score-first TTT on the 42,364 fp16 calibration parameters (-0.024 BPB). 15.72 MB max-seed total, 275 KB margin under the 16 MB cap. BPB denominator audit closed end-to-end: bundled verify_bpb.py shows the LUT byte count matches SentencePiece decoder bytes exactly on the scored slice (delta=+0), and that count (151,078,879) is bit-identical to the runtime eval_bytes printed by train_gpt.py for every seed. See the records folder README for full diff vs the ternary record + the verifier output.

CiprianFlorin-Ifrim · 2026-05-01T14:23:58Z

Great work, nice to see someone expanded from the xnor, binary and ternary stuff.
Btw there's a V2 of the Ternary that is slightly better #920. Take a look, maybe you find something else that helps you.

deniskurlov · 2026-05-01T21:28:30Z

Great work, nice to see someone expanded from the xnor, binary and ternary stuff.
Btw there's a V2 of the Ternary that is slightly better #920. Take a look, maybe you find something else that helps you.

Thanks! Just looked at #920 — two main diffs from V1: a larger EMBED_DIM (which I'd already pushed further, since I'm on a bigger vocab), and an fp16→bf16 swap on the per-group dequant scales. As far as I understand, the bf16 motivation in the PR is specifically about the 1/(1-zero_frac) shrinkage correction amplifying fp16 rounding — which I don't use in quinary (it's clumsy with two non-zero magnitudes). So unclear to me whether it ports cleanly, but I'll ablate it on top of my stack and report back either way.

CiprianFlorin-Ifrim · 2026-05-01T23:34:44Z

@deniskurlov yep those 2 are the main ones, the warmdown improvement might not matter to you as a 3rd, and the quant scale calculation change also might not matter to you as with 5 values the scale has a lower effect.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Quinary quantization + SP16384 + per-group lrzip + TTT - bpb 1.1384#2086

Non-record: Quinary quantization + SP16384 + per-group lrzip + TTT - bpb 1.1384#2086
deniskurlov wants to merge 1 commit intoopenai:mainfrom
deniskurlov:add-quinary-non-record-submission

deniskurlov commented May 1, 2026

Uh oh!

CiprianFlorin-Ifrim commented May 1, 2026

Uh oh!

deniskurlov commented May 1, 2026

Uh oh!

CiprianFlorin-Ifrim commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

deniskurlov commented May 1, 2026

Results (3 seeds, 8×H100 SXM)

Architecture (config)

Package contents

Reproducer

Compliance

Acknowledgements

Uh oh!

CiprianFlorin-Ifrim commented May 1, 2026

Uh oh!

deniskurlov commented May 1, 2026

Uh oh!

CiprianFlorin-Ifrim commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants