Non-record: Quinary quantization + SP16384 + per-group lrzip + TTT - bpb 1.1384#2086
Non-record: Quinary quantization + SP16384 + per-group lrzip + TTT - bpb 1.1384#2086deniskurlov wants to merge 1 commit intoopenai:mainfrom
Conversation
…1.1384 BPB, 3-seed)
3-seed mean TTT BPB 1.1384 ± 0.0009 std on 8×H100 SXM (10 min training) in
records/track_non_record_16mb/2026-04-30_Quinary_53M_10L_576d_SP16384_TTT/.
Direct quinary {-2,-1,0,+1,+2} fork of @CiprianFlorin-Ifrim's 2026-03-24
ternary record (PR openai#640): inherits the U-Net topology + Muon + factored
tied embedding + FP8 QAT + YaRN + FlashAttention-3, swaps ternary -> quinary
(-0.021 BPB roundtrip BPB), SP8192 -> SP16384 tokenizer, single-blob LZMA ->
layout-aware per-stream archive (structurally based on @codemath3000's
PR openai#1855), and stride-16 sliding eval -> score-first TTT on the 42,364
fp16 calibration parameters (-0.024 BPB). 15.72 MB max-seed total,
275 KB margin under the 16 MB cap.
BPB denominator audit closed end-to-end: bundled verify_bpb.py shows the
LUT byte count matches SentencePiece decoder bytes exactly on the scored
slice (delta=+0), and that count (151,078,879) is bit-identical to the
runtime eval_bytes printed by train_gpt.py for every seed.
See the records folder README for full diff vs the ternary record + the
verifier output.
|
Great work, nice to see someone expanded from the xnor, binary and ternary stuff. |
Thanks! Just looked at #920 — two main diffs from V1: a larger EMBED_DIM (which I'd already pushed further, since I'm on a bigger vocab), and an fp16→bf16 swap on the per-group dequant scales. As far as I understand, the bf16 motivation in the PR is specifically about the 1/(1-zero_frac) shrinkage correction amplifying fp16 rounding — which I don't use in quinary (it's clumsy with two non-zero magnitudes). So unclear to me whether it ports cleanly, but I'll ablate it on top of my stack and report back either way. |
|
@deniskurlov yep those 2 are the main ones, the warmdown improvement might not matter to you as a 3rd, and the quant scale calculation change also might not matter to you as with 5 values the scale has a lower effect. |
Quinary {-2,-1,0,+1,+2} weights (5-state, base-5 packed) + 10L (5 Encoder + 5 Decoder) 576d U-Net + Muon + 4× relu² MLP + Tied Embed (380→576) + Poly5 Softcap + YaRN 2048 + SP16384 BPE + FP8 QAT + 5-bit Scale Quant + Layout-Aware Per-Stream Archive + Score-First TTT (3 epochs, fp16-calibration-only)
bpb 1.1384 ± 0.0009 std (3-seed TTT mean) | 15.72 MB total artifact max (all 3 seeds FIT) | 8×H100 SXM, 7,800 steps in 599s + ~3.6 min TTT-eval
Results (3 seeds, 8×H100 SXM)
Non-record submission to
track_non_record_16mb. Direct quinary {-2,-1,0,+1,+2} fork of @CiprianFlorin-Ifrim's 2026-03-24 ternary record (PR #640, 1.1570 sliding BPB), exploring whether one tick up the discrete-weight axis (3 → 5 levels per parameter) buys more than it costs at fixed compute and 16 MB budget.Architecture (config)
Package contents
records/track_non_record_16mb/2026-04-30_Quinary_53M_10L_576d_SP16384_TTT/:README.md— full Diff-from-ternary-record table, BPB-denominator audit methodology, validation accountingsubmission.json— metadata + per-seedseed_resultswith verifiedeval_tokens/eval_bytessetup.sh,run.sh,requirements.txt,train_gpt.py— reproducer (a baretorchrunreproduces the canonical config;Hyperparametersdefaults matchrun.shline-for-line)verify_bpb.py— standalone reviewer-runnable BPB-LUT-vs-SentencePiece-decoder check (256-int32 header loader matchingtrain_gpt.py, exact eval-slice check, BOS-delimited docwise decoder, tokenizer SHA-256, UNK guard)fineweb_16384_bpe.model+.vocab— bundled tokenizer (sha256abaec140336…ac432a); also published atdeniskurlov/parameter-golf-fineweb-sp16384quinary_seed{42,1337,7}.txt— three full per-seed training/TTT logsReproducer
Compliance
Acknowledgements