Skip to content

Non-record: Quinary quantization + SP16384 + per-group lrzip + TTT - bpb 1.1384#2086

Open
deniskurlov wants to merge 1 commit intoopenai:mainfrom
deniskurlov:add-quinary-non-record-submission
Open

Non-record: Quinary quantization + SP16384 + per-group lrzip + TTT - bpb 1.1384#2086
deniskurlov wants to merge 1 commit intoopenai:mainfrom
deniskurlov:add-quinary-non-record-submission

Conversation

@deniskurlov
Copy link
Copy Markdown

Quinary {-2,-1,0,+1,+2} weights (5-state, base-5 packed) + 10L (5 Encoder + 5 Decoder) 576d U-Net + Muon + 4× relu² MLP + Tied Embed (380→576) + Poly5 Softcap + YaRN 2048 + SP16384 BPE + FP8 QAT + 5-bit Scale Quant + Layout-Aware Per-Stream Archive + Score-First TTT (3 epochs, fp16-calibration-only)

bpb 1.1384 ± 0.0009 std (3-seed TTT mean) | 15.72 MB total artifact max (all 3 seeds FIT) | 8×H100 SXM, 7,800 steps in 599s + ~3.6 min TTT-eval

Results (3 seeds, 8×H100 SXM)

Seed TTT BPB RT BPB Total bytes
42 1.1381 1.1626 15,714,938
1337 1.1394 1.1633 15,721,124
7 1.1378 1.1622 15,724,839
Mean ± std 1.1384 ± 0.0009 1.1627 ± 0.0006

Non-record submission to track_non_record_16mb. Direct quinary {-2,-1,0,+1,+2} fork of @CiprianFlorin-Ifrim's 2026-03-24 ternary record (PR #640, 1.1570 sliding BPB), exploring whether one tick up the discrete-weight axis (3 → 5 levels per parameter) buys more than it costs at fixed compute and 16 MB budget.

Architecture (config)

Layers 10 (5 encoder + 5 decoder, symmetric U-Net)
Model dim 576
Heads 6 query / 3 KV (GQA), head_dim=96
MLP 4× expansion, hidden=2304, relu² activation
Embed tied, 16384 vocab, 380→576 bottleneck
RoPE YaRN, base=5000, max_len=2048
Softcap poly5, cap=10
Quinary group size 192, per-group absmean
Optimizer Muon (matrix params), Adam (scalars + tied embed)
Batch / seq 524 288 tok / 1024
Wallclock cap 599 s

Package contents

records/track_non_record_16mb/2026-04-30_Quinary_53M_10L_576d_SP16384_TTT/:

  • README.md — full Diff-from-ternary-record table, BPB-denominator audit methodology, validation accounting
  • submission.json — metadata + per-seed seed_results with verified eval_tokens / eval_bytes
  • setup.sh, run.sh, requirements.txt, train_gpt.py — reproducer (a bare torchrun reproduces the canonical config; Hyperparameters defaults match run.sh line-for-line)
  • verify_bpb.py — standalone reviewer-runnable BPB-LUT-vs-SentencePiece-decoder check (256-int32 header loader matching train_gpt.py, exact eval-slice check, BOS-delimited docwise decoder, tokenizer SHA-256, UNK guard)
  • fineweb_16384_bpe.model + .vocab — bundled tokenizer (sha256 abaec140336…ac432a); also published at deniskurlov/parameter-golf-fineweb-sp16384
  • quinary_seed{42,1337,7}.txt — three full per-seed training/TTT logs

Reproducer

cd records/track_non_record_16mb/2026-04-30_Quinary_53M_10L_576d_SP16384_TTT
bash setup.sh                           # apt lrzip + pip + FA3 + ~23 GB HF download                  
SEED=42 bash run.sh                     # ~10 min train + ~3.6 min TTT eval                           
python3 verify_bpb.py                   # ~1 min, BPB-LUT denominator audit                                                                                                             

Compliance

  • bytes_total ≤ 16,000,000 (max-seed = 15,724,839 — seed=7; margin 275,161)
  • Training ≤ 10 min (~599s wallclock × 3 seeds)
  • Evaluation ≤ 10 min (TTT 213–216 s + roundtrip ~80 s)
  • Score-first TTT on the fp16 calibration parameters only (~42k = 0.08% of the model); quinary weights frozen
  • BPB byte-count LUT verified against SentencePiece decoder on the exact scored slice (delta=+0)
  • Reproducibly runs end-to-end from bash setup.sh && bash run.sh on a fresh 8×H100 SXM pod (verified on a fully-nuked + re-cloned pod 2026-05-01)

Acknowledgements

…1.1384 BPB, 3-seed)

  3-seed mean TTT BPB 1.1384 ± 0.0009 std on 8×H100 SXM (10 min training) in
  records/track_non_record_16mb/2026-04-30_Quinary_53M_10L_576d_SP16384_TTT/.

  Direct quinary {-2,-1,0,+1,+2} fork of @CiprianFlorin-Ifrim's 2026-03-24
  ternary record (PR openai#640): inherits the U-Net topology + Muon + factored
  tied embedding + FP8 QAT + YaRN + FlashAttention-3, swaps ternary -> quinary
  (-0.021 BPB roundtrip BPB), SP8192 -> SP16384 tokenizer, single-blob LZMA ->
  layout-aware per-stream archive (structurally based on @codemath3000's
  PR openai#1855), and stride-16 sliding eval -> score-first TTT on the 42,364
  fp16 calibration parameters (-0.024 BPB). 15.72 MB max-seed total,
  275 KB margin under the 16 MB cap.

  BPB denominator audit closed end-to-end: bundled verify_bpb.py shows the
  LUT byte count matches SentencePiece decoder bytes exactly on the scored
  slice (delta=+0), and that count (151,078,879) is bit-identical to the
  runtime eval_bytes printed by train_gpt.py for every seed.

  See the records folder README for full diff vs the ternary record + the
  verifier output.
@CiprianFlorin-Ifrim
Copy link
Copy Markdown
Contributor

Great work, nice to see someone expanded from the xnor, binary and ternary stuff.
Btw there's a V2 of the Ternary that is slightly better #920. Take a look, maybe you find something else that helps you.

@deniskurlov
Copy link
Copy Markdown
Author

Great work, nice to see someone expanded from the xnor, binary and ternary stuff.
Btw there's a V2 of the Ternary that is slightly better #920. Take a look, maybe you find something else that helps you.

Thanks! Just looked at #920 — two main diffs from V1: a larger EMBED_DIM (which I'd already pushed further, since I'm on a bigger vocab), and an fp16→bf16 swap on the per-group dequant scales. As far as I understand, the bf16 motivation in the PR is specifically about the 1/(1-zero_frac) shrinkage correction amplifying fp16 rounding — which I don't use in quinary (it's clumsy with two non-zero magnitudes). So unclear to me whether it ports cleanly, but I'll ablate it on top of my stack and report back either way.

@CiprianFlorin-Ifrim
Copy link
Copy Markdown
Contributor

@deniskurlov yep those 2 are the main ones, the warmdown improvement might not matter to you as a 3rd, and the quant scale calculation change also might not matter to you as with 5 values the scale has a lower effect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants