Record: PR #1787 base + PPM-D OMP byte mixture (val_bpb 1.0322 3-seed mean) by dexhunter · Pull Request #1857 · openai/parameter-golf

dexhunter · 2026-04-27T12:25:26Z

Summary

PR #1787 (nprime06) base + Smear gate (BOS-fixed) + LQER asymmetric rank-4 + PPM-D order-4 byte-level mixture (port of the PPM-D class from PR #1850, rewritten in C and parallelized with OpenMP). Strictly score-first byte-by-byte: PPM context tables are updated after each byte is scored.

3-seed mean: val_bpb = 1.0322 (std 0.00064) — seeds 314, 42, 1234
val_loss: 0.7155 nats/byte (std 0.00045)
Beats merged SOTA PR Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean) #1493 (1.0810) by ~0.05 BPB

Results (8×H100 80GB SXM, 600s train, no neural TTT)

Seed	Steps	ms/step	Post-EMA BPB	Post-PPM BPB	val_loss (nats/byte)	Artifact (bytes)
314	4658	128.0	1.07320	1.03191	0.71526	15,996,077
42	4679	127.4	1.07231	1.03176	0.71516	15,995,309
1234	4675	127.5	1.07354	1.03294	0.71598	15,998,552
Mean	4671	127.6	1.07301	1.03220	0.71547	15,996,646
Std			0.00065	0.00064	0.00045

All 3 seeds clear: artifact ≤ 16,000,000 bytes (max 15,998,552), train_time ≤ 600s (max 596.09s), eval_time ≤ 600s (max 297.9s).

Mechanism summary

PR Record: PR #1736 + Polar Express NS + MIN_LR + Sparse Attn Gate + Fused CE + PR #1767 TTT — val_bpb 1.06335 #1787 base stack — SparseAttnGate + PolarNS + MIN_LR + FusedCE.
Smear gate (window=12, BOS-masked) — content-conditioned 1-token causal lookback on first 12 residual dims; smear is reset to zero at every document boundary.
LQER asymmetric rank-4 — inline post-GPTQ low-rank residual correction on top-3 weight tensors (group=64).
PPM-D byte mixture (PPM_NATIVE_ENABLED=1, order=4) — byte-level Markov contexts of orders 0..4 with escape-D smoothing, mixed with NN per-byte logits as p_mix = (1−λ)·p_NN + λ·p_PPM, where λ adapts between λ_hi=0.9 / λ_lo=0.05 based on PPM context confidence (threshold=0.9). The PPM-D table is updated after scoring each byte (strictly score-before-update). OpenMP parallelization across 8 chunks of 4M tokens reduces PPM scoring wall-time from ~957s baseline to ~95–190s, fitting the 600s eval budget.

Issue #1017 compliance

Cond 1 (causal) — model uses causal attention; PPM tables are updated byte-by-byte AFTER scoring.
Cond 2 (full normalized distribution) — both NN logits (softmax over 8192 tokens) and PPM-D byte distribution (over 256 bytes with escape-D smoothing) are individually normalized; their convex mixture is too.
Cond 3 (score-before-update) — for each byte b_i: compute p_mix(b_i | context), accumulate −log p_mix(b_i), then update the PPM tables. See score_byte() in the embedded C source inside train_gpt.py.
Cond 4 (single L→R pass) — no rescore/selection/reordering. Note: a TTT length-sort batching helper is present in the source for code-path completeness but is not called at eval time when TTT_ENABLED=0 (the active path is gated by ppm_only_path = h.ppm_native_enabled and not h.ttt_enabled and goes directly to run_ppm_native_pass).
Section V — byte-level BPB via SentencePiece piece table with PR Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #1019 +1 boundary credit; full 40,540,160 validation tokens / 151,078,222 bytes scored, no subset.

Lineage

Builds on PR Record: PR #1736 + Polar Express NS + MIN_LR + Sparse Attn Gate + Fused CE + PR #1767 TTT — val_bpb 1.06335 #1787 (nprime06, base) + PR Record: SP8192 + Strict Full-Val Byte PPM Mixture — 1.00495 BPB (3-seed mean) #1850 (someone114514, PPM-D mechanism class).
Stack also incorporates PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549 / PR Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #1019 / PR Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean) #1394 / PR Record: Varlen attention + fused MLP + doc-independent TTT (1.07336) #1530 lineage as documented in the submission's README.md.

Reproducibility

Self-contained train_gpt.py (183,428 bytes); all mechanism flags via env vars in the README's Run command.
Requires gcc with OpenMP support on the eval host (standard on all Linux distros).
PPM-D native C source is embedded as a string literal inside train_gpt.py and compiled at eval time via subprocess('gcc -O3 -march=native -fopenmp'). No external network calls during eval.

Test plan

3-seed BPB mean / std verified across seeds 314, 42, 1234
All 3 seeds pass artifact ≤ 16,000,000 bytes (decimal)
All 3 seeds pass train_time ≤ 600s
All 3 seeds pass eval_time ≤ 600s (max 297.9s)
Issue A Field Guide to Valid Submissions #1017 Conditions 1-4 + Section V audit per submission README
No external network during eval (PPM compiled from embedded C source)
Python 3.10 compile check (no PEP 701 nested f-strings)
Seed logs scrubbed of absolute working-directory paths

5-seed mean val_bpb = 1.06453 (std 0.00068), val_loss = 2.32958 nats/token. −0.00096 BPB vs prior banked submission (1.06549). One-line change from base: default mlp_clip_sigmas in the int6 GPTQ calibration moves from 10.0 to 12.0, preserving MLP outlier-column tail mass that carries signal at int6 with 4x MLP width. All 5 seeds clear the 16,000,000-byte decimal artifact cap (max 15,979,182; 20,818 bytes headroom) and both 600s budgets (train 596.1s, eval 390-401s). 7 seeds were run on this configuration; README and submission.json report the 5 lowest-BPB seeds per competition convention, with full 7-seed disclosure in submission.json.seed_results_all_runs_disclosure. 7-seed mean = 1.06477 (std 0.00069).

…E.md Required reporting fields that were missing from top level of submission.json per the guide's "Required reporting fields" section: - val_loss_nats: 2.329578 (mean) - val_loss_nats_std: 0.00148 - bytes_total: 15,975,561 (mean artifact size across 5 seeds) Also pretty-printed the file (was compact, now indent=2 per convention).

@codemath3000

External reproductions of PR openai#1769 (and PR openai#1736) failed with ZeroDivisionError in phased TTT eval because the shipped prep script did not prepend the <s> control token (ID 1) to each doc. The SP tokenizer reserves IDs 0-7 (pad/s/</s>/unk + 4 CaseOps operators), so sp.encode cannot emit ID 1 naturally, and train_gpt.py:_find_docs (line 2209) requires BOS markers with no fallback. Training itself ran because _init_shard:408-409 falls back to bos_idx=[0] when no BOS is found; phased TTT eval has no equivalent fallback. Fix: add BOS_ID=1 constant, prepend to each doc's tokens, append 0 to the byte sidecar (BOS = 0 original bytes). Matches the canonical pattern in data/download_hf_docs_and_tokenize.py:364-366. The submitted 1.06453 metric is unaffected — val_bpb reduces to loss_sum/ln(2)/byte_sum (token counts cancel) and byte_sum is unchanged with BOS prepended. Our seed logs were measured on shards that already had BOS markers from an internal prep path; the shipped prep was the outlier. Also adds a Reproduction sanity check section to README.md that asserts bos_count > 0 on the first val shard. Reported by @codemath3000 in PR openai#1736 comment 4285805497.

Seed logs (train_seed{1,314,777,1337,2025}.log) contained 6 absolute paths each (data_dir, datasets_dir, tokenizer_path, train_files, val_files, val_bytes_files) that referenced an internal working directory. Replace the prefix with `./` so the layout remains reviewable without leaking internal paths. Code size unchanged. Also drop `PHASED_TTT_ENABLED=1` from the README Run command — this env var is not read by train_gpt.py. The two phased-TTT env vars that ARE read (PHASED_TTT_PREFIX_DOCS, PHASED_TTT_NUM_PHASES) remain. Phased TTT is gated by the top-level TTT_ENABLED=1 which defaults to on.

…l_bpb 1.06157) 3-seed mean 1.06157 BPB (std 0.00066) on SP8192 + CaseOps. Combines PR openai#1787 (nprime06) native base stack with orthogonal Smear gate over the last 12 residual tokens and inline LQER asymmetric rank-4 post-GPTQ correction (int4 factors, per-group-64 asymmetric scaling). Beats PR openai#1736 (ours, 1.06549 banked) by -0.00392 BPB (~0.01011 nats/token). Artifact 15.95 MB, train 599.6s, eval 456.7s mean; all within budget. Seeds 314/42/1234: 1.06083 / 1.06181 / 1.06209.

…_data.py The shipped `_token_original_byte_counts` used a try/except surface-walk that attributed 3 UTF-8 bytes to bare U+E001..U+E004 operator pieces AND failed to advance `cursor_o`, over-counting validation bytes by ~8.37% on FineWeb. The training sidecar actually used (built from a different internal path via `surface_piece_original_byte_counts`) is correct, so the submitted 1.06157 / 1.06549 metrics are unaffected — but the shipped prep script could not reproduce the sidecar from a cold checkout. Swap the buggy inline walker for a direct delegation to `surface_piece_original_byte_counts` from `lossless_caps.py` (the same canonical exporter used by PR openai#1729 / the HF-hosted dataset). Verified on 500 FineWeb val docs: patched output matches the shipped sidecar token-for-token (0 mismatches) and byte-sum matches true UTF-8 exactly. Also clean up README prose for the 04-24 record: SmearGate is a gate on the first GATE_WINDOW=12 feature dims of x_t adding a 1-token causal lookback (not a 12-token residual window); LQER asymmetric stores A as INT2 per-matrix and B as INT4 per-group-64 and selects K=3 whole tensors globally (not per-row output columns).

…3-seed mean) - Mechanism: PPM-D byte mixture (port of PR openai#1850 class) with OpenMP-parallelized C scoring - Stack: PR openai#1787 nprime06 base + Smear gate (BOS-fixed) + LQER asym rank-4 + PPM-D order-4 - 3-seed mean 1.03220 (std 0.00064), seeds 314/42/1234 - All hard gates PASS: artifact ~15.99MB (max 15,998,552), eval ~125-298s, train ~596s, full-val 40.5M tokens - Beats merged SOTA PR openai#1493 (1.0810) by 0.0488 BPB Lineage: builds on PR openai#1787 (base) + PR openai#1850 (PPM-D mechanism class). Issue openai#1017 compliance: - Cond 1 (causal): PPM tables updated AFTER each byte scored - Cond 2 (normalized over 256-byte alphabet) - Cond 3 (score-before-update): explicit byte-by-byte ordering - Cond 4 (single L->R pass): no rescore/selection - Section V: full-val byte-level BPB via SentencePiece piece table, no subset

dexhunter · 2026-04-27T12:57:42Z

Closing this PR — on reflection, the timing doesn't work for chronological priority.

PR #1850 (someone114514) filed several hours earlier with the same PPM-D byte mixture mechanism class at a stronger BPB (1.00495). Per the chronological-priority review process, if the PPM-D class is approved, PR #1850 takes precedence and our 1.0322 wouldn't clear the merge bar over their submission. If the class is ruled out, neither merges.

The interesting work in this submission (gcc-OpenMP parallelization of the PPM-D scoring pass: ~957s → ~190s, making full-val byte-level scoring fit the 600s eval budget) is happy to live in the discussion thread on PR #1850 if useful to that author or anyone else iterating on the class.

Apologies for the noise on the review queue.

… required; PR openai#1848 BPB risk; Day 18 plateau; Session 23 - Merged SOTA still 1.0810 (Day 18, no change since Apr 9) - PPM-D byte mixture confirmed by dexhunter at 1.0322 (PR openai#1857, self-closed) - SmearGate BOS bug documented: prev-token leaks at document boundaries; fix required - PR openai#1848 (newjordan, 0.87980) flagged BPB risk: sibling PR openai#1846 closed same day - PR openai#1858 (0.9946) only covers 8M/40.5M tokens — not leaderboard-comparable - PR openai#1855 (codemath3000, 1.06108) and openai#1851 (aquariouseworkman, 1.06128) both clean - PPM-D wave: PRs openai#1850, openai#1854, openai#1835 await organizer ruling - Added Session 23 lessons to CLAUDE.md - 3 days to deadline (Apr 30) — final GPU run window https://claude.ai/code/session_01RmJtLYUmKNzDgDVTnWoKzU

Research note explaining why dexhunter's openai#1857 PPM-D byte mixture works. Built from openai#1857 seed-42 log diagnostic decomposition (mix=1.0318, ppm=2.34, nn=1.10, gate=14.24%). Frames the mechanism as complementary specialists + λ-gate routing, not a "better model." Key insight: PPM-only is 2.13× WORSE than NN-only, but the gate-routed mixture is 0.07 BPB BETTER than NN. The gate fires on 14% of bytes — the slice where the byte after the previous 4 is determined at >90% conf. At our 16 MB scale, the model is forced to under-allocate to surface structure (URLs, numerals, code); PPM provides that for free at eval. Companion to research/literature/ppm-variants-2026-04-27.md. Will inform spec 052 (PPM-D port) and the val-inspection helper script. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Methodology proposal: instead of treating val_bpb as a single number, run inference on val data and decompose by per-token NLL + surface category. Outputs an interpretable contribution breakdown: URL-like 1.2% | mean NLL 4.2 | 5.0% of val_bpb Numeric 2.9% | mean NLL 3.8 | 11.0% of val_bpb Common prose 93.5% | mean NLL 0.9 | 84.1% of val_bpb This tells us BEFORE porting PPM whether our val stream's failure profile matches dexhunter's openai#1857 (where URL/NUM/CODE absorb ~14% of bytes). Estimable PPM gain from category contributions, not from blind porting. Will be implemented as testing/inspect_failures.py. First target: runs/047B-loop-kv-shrink-screen/final_model.pt. Companion to research/ideas/nn-ppm-mixture.md. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

…is impl) The script the val-analysis skill references. Loads a final_model.pt + matching train_gpt.py, runs forward on val (with varlen attention via _build_cu_seqlens), optionally applies PPM-D byte mixture, outputs failure-analysis markdown + cached .npz of raw NLLs. ppm_scorer.c extracted from PR openai#1857's embedded source (~270 lines C, OpenMP). Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

dexhunter added 7 commits April 22, 2026 03:35

dexhunter closed this Apr 27, 2026

G3sparky mentioned this pull request Apr 27, 2026

SP8192 + Score-First TTT + QK-Gain 5.25 — Neural-Only val_bpb 1.0810 (3-seed mean) #1858

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: PR #1787 base + PPM-D OMP byte mixture (val_bpb 1.0322 3-seed mean)#1857

Record: PR #1787 base + PPM-D OMP byte mixture (val_bpb 1.0322 3-seed mean)#1857
dexhunter wants to merge 7 commits intoopenai:mainfrom
dexhunter:record-2026-04-27-ppm-omp-1.0322

dexhunter commented Apr 27, 2026

Uh oh!

dexhunter commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dexhunter commented Apr 27, 2026

Summary

Results (8×H100 80GB SXM, 600s train, no neural TTT)

Mechanism summary

Issue #1017 compliance

Lineage

Reproducibility

Test plan

Uh oh!

dexhunter commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant