Skip to content

Record: PR #1787 base + PPM-D OMP byte mixture (val_bpb 1.0322 3-seed mean)#1857

Closed
dexhunter wants to merge 7 commits intoopenai:mainfrom
dexhunter:record-2026-04-27-ppm-omp-1.0322
Closed

Record: PR #1787 base + PPM-D OMP byte mixture (val_bpb 1.0322 3-seed mean)#1857
dexhunter wants to merge 7 commits intoopenai:mainfrom
dexhunter:record-2026-04-27-ppm-omp-1.0322

Conversation

@dexhunter
Copy link
Copy Markdown
Contributor

Summary

PR #1787 (nprime06) base + Smear gate (BOS-fixed) + LQER asymmetric rank-4 + PPM-D order-4 byte-level mixture (port of the PPM-D class from PR #1850, rewritten in C and parallelized with OpenMP). Strictly score-first byte-by-byte: PPM context tables are updated after each byte is scored.

Results (8×H100 80GB SXM, 600s train, no neural TTT)

Seed Steps ms/step Post-EMA BPB Post-PPM BPB val_loss (nats/byte) Artifact (bytes)
314 4658 128.0 1.07320 1.03191 0.71526 15,996,077
42 4679 127.4 1.07231 1.03176 0.71516 15,995,309
1234 4675 127.5 1.07354 1.03294 0.71598 15,998,552
Mean 4671 127.6 1.07301 1.03220 0.71547 15,996,646
Std 0.00065 0.00064 0.00045

All 3 seeds clear: artifact ≤ 16,000,000 bytes (max 15,998,552), train_time ≤ 600s (max 596.09s), eval_time ≤ 600s (max 297.9s).

Mechanism summary

  1. PR Record: PR #1736 + Polar Express NS + MIN_LR + Sparse Attn Gate + Fused CE + PR #1767 TTT — val_bpb 1.06335 #1787 base stack — SparseAttnGate + PolarNS + MIN_LR + FusedCE.
  2. Smear gate (window=12, BOS-masked) — content-conditioned 1-token causal lookback on first 12 residual dims; smear is reset to zero at every document boundary.
  3. LQER asymmetric rank-4 — inline post-GPTQ low-rank residual correction on top-3 weight tensors (group=64).
  4. PPM-D byte mixture (PPM_NATIVE_ENABLED=1, order=4) — byte-level Markov contexts of orders 0..4 with escape-D smoothing, mixed with NN per-byte logits as p_mix = (1−λ)·p_NN + λ·p_PPM, where λ adapts between λ_hi=0.9 / λ_lo=0.05 based on PPM context confidence (threshold=0.9). The PPM-D table is updated after scoring each byte (strictly score-before-update). OpenMP parallelization across 8 chunks of 4M tokens reduces PPM scoring wall-time from ~957s baseline to ~95–190s, fitting the 600s eval budget.

Issue #1017 compliance

  • Cond 1 (causal) — model uses causal attention; PPM tables are updated byte-by-byte AFTER scoring.
  • Cond 2 (full normalized distribution) — both NN logits (softmax over 8192 tokens) and PPM-D byte distribution (over 256 bytes with escape-D smoothing) are individually normalized; their convex mixture is too.
  • Cond 3 (score-before-update) — for each byte b_i: compute p_mix(b_i | context), accumulate −log p_mix(b_i), then update the PPM tables. See score_byte() in the embedded C source inside train_gpt.py.
  • Cond 4 (single L→R pass) — no rescore/selection/reordering. Note: a TTT length-sort batching helper is present in the source for code-path completeness but is not called at eval time when TTT_ENABLED=0 (the active path is gated by ppm_only_path = h.ppm_native_enabled and not h.ttt_enabled and goes directly to run_ppm_native_pass).
  • Section V — byte-level BPB via SentencePiece piece table with PR Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #1019 +1 boundary credit; full 40,540,160 validation tokens / 151,078,222 bytes scored, no subset.

Lineage

Reproducibility

  • Self-contained train_gpt.py (183,428 bytes); all mechanism flags via env vars in the README's Run command.
  • Requires gcc with OpenMP support on the eval host (standard on all Linux distros).
  • PPM-D native C source is embedded as a string literal inside train_gpt.py and compiled at eval time via subprocess('gcc -O3 -march=native -fopenmp'). No external network calls during eval.

Test plan

  • 3-seed BPB mean / std verified across seeds 314, 42, 1234
  • All 3 seeds pass artifact ≤ 16,000,000 bytes (decimal)
  • All 3 seeds pass train_time ≤ 600s
  • All 3 seeds pass eval_time ≤ 600s (max 297.9s)
  • Issue A Field Guide to Valid Submissions #1017 Conditions 1-4 + Section V audit per submission README
  • No external network during eval (PPM compiled from embedded C source)
  • Python 3.10 compile check (no PEP 701 nested f-strings)
  • Seed logs scrubbed of absolute working-directory paths

5-seed mean val_bpb = 1.06453 (std 0.00068), val_loss = 2.32958 nats/token.
−0.00096 BPB vs prior banked submission (1.06549).

One-line change from base: default mlp_clip_sigmas in the int6 GPTQ
calibration moves from 10.0 to 12.0, preserving MLP outlier-column
tail mass that carries signal at int6 with 4x MLP width.

All 5 seeds clear the 16,000,000-byte decimal artifact cap
(max 15,979,182; 20,818 bytes headroom) and both 600s budgets
(train 596.1s, eval 390-401s).

7 seeds were run on this configuration; README and submission.json
report the 5 lowest-BPB seeds per competition convention, with full
7-seed disclosure in submission.json.seed_results_all_runs_disclosure.
7-seed mean = 1.06477 (std 0.00069).
…E.md

Required reporting fields that were missing from top level of
submission.json per the guide's "Required reporting fields" section:
- val_loss_nats: 2.329578 (mean)
- val_loss_nats_std: 0.00148
- bytes_total: 15,975,561 (mean artifact size across 5 seeds)

Also pretty-printed the file (was compact, now indent=2 per convention).
External reproductions of PR openai#1769 (and PR openai#1736) failed with
ZeroDivisionError in phased TTT eval because the shipped prep script
did not prepend the <s> control token (ID 1) to each doc. The SP
tokenizer reserves IDs 0-7 (pad/s/</s>/unk + 4 CaseOps operators),
so sp.encode cannot emit ID 1 naturally, and train_gpt.py:_find_docs
(line 2209) requires BOS markers with no fallback. Training itself
ran because _init_shard:408-409 falls back to bos_idx=[0] when no
BOS is found; phased TTT eval has no equivalent fallback.

Fix: add BOS_ID=1 constant, prepend to each doc's tokens, append 0
to the byte sidecar (BOS = 0 original bytes). Matches the canonical
pattern in data/download_hf_docs_and_tokenize.py:364-366.

The submitted 1.06453 metric is unaffected — val_bpb reduces to
loss_sum/ln(2)/byte_sum (token counts cancel) and byte_sum is
unchanged with BOS prepended. Our seed logs were measured on shards
that already had BOS markers from an internal prep path; the shipped
prep was the outlier.

Also adds a Reproduction sanity check section to README.md that
asserts bos_count > 0 on the first val shard.

Reported by @codemath3000 in PR openai#1736 comment 4285805497.
Seed logs (train_seed{1,314,777,1337,2025}.log) contained 6 absolute
paths each (data_dir, datasets_dir, tokenizer_path, train_files,
val_files, val_bytes_files) that referenced an internal working
directory. Replace the prefix with `./` so the layout remains
reviewable without leaking internal paths. Code size unchanged.

Also drop `PHASED_TTT_ENABLED=1` from the README Run command — this env
var is not read by train_gpt.py. The two phased-TTT env vars that ARE
read (PHASED_TTT_PREFIX_DOCS, PHASED_TTT_NUM_PHASES) remain. Phased TTT
is gated by the top-level TTT_ENABLED=1 which defaults to on.
…l_bpb 1.06157)

3-seed mean 1.06157 BPB (std 0.00066) on SP8192 + CaseOps.
Combines PR openai#1787 (nprime06) native base stack with orthogonal Smear gate
over the last 12 residual tokens and inline LQER asymmetric rank-4
post-GPTQ correction (int4 factors, per-group-64 asymmetric scaling).

Beats PR openai#1736 (ours, 1.06549 banked) by -0.00392 BPB (~0.01011 nats/token).
Artifact 15.95 MB, train 599.6s, eval 456.7s mean; all within budget.

Seeds 314/42/1234: 1.06083 / 1.06181 / 1.06209.
…_data.py

The shipped `_token_original_byte_counts` used a try/except surface-walk
that attributed 3 UTF-8 bytes to bare U+E001..U+E004 operator pieces AND
failed to advance `cursor_o`, over-counting validation bytes by ~8.37%
on FineWeb. The training sidecar actually used (built from a different
internal path via `surface_piece_original_byte_counts`) is correct, so
the submitted 1.06157 / 1.06549 metrics are unaffected — but the shipped
prep script could not reproduce the sidecar from a cold checkout.

Swap the buggy inline walker for a direct delegation to
`surface_piece_original_byte_counts` from `lossless_caps.py` (the same
canonical exporter used by PR openai#1729 / the HF-hosted dataset). Verified
on 500 FineWeb val docs: patched output matches the shipped sidecar
token-for-token (0 mismatches) and byte-sum matches true UTF-8 exactly.

Also clean up README prose for the 04-24 record: SmearGate is a gate
on the first GATE_WINDOW=12 feature dims of x_t adding a 1-token
causal lookback (not a 12-token residual window); LQER asymmetric
stores A as INT2 per-matrix and B as INT4 per-group-64 and selects
K=3 whole tensors globally (not per-row output columns).
…3-seed mean)

- Mechanism: PPM-D byte mixture (port of PR openai#1850 class) with OpenMP-parallelized C scoring
- Stack: PR openai#1787 nprime06 base + Smear gate (BOS-fixed) + LQER asym rank-4 + PPM-D order-4
- 3-seed mean 1.03220 (std 0.00064), seeds 314/42/1234
- All hard gates PASS: artifact ~15.99MB (max 15,998,552), eval ~125-298s, train ~596s, full-val 40.5M tokens
- Beats merged SOTA PR openai#1493 (1.0810) by 0.0488 BPB

Lineage: builds on PR openai#1787 (base) + PR openai#1850 (PPM-D mechanism class).

Issue openai#1017 compliance:
- Cond 1 (causal): PPM tables updated AFTER each byte scored
- Cond 2 (normalized over 256-byte alphabet)
- Cond 3 (score-before-update): explicit byte-by-byte ordering
- Cond 4 (single L->R pass): no rescore/selection
- Section V: full-val byte-level BPB via SentencePiece piece table, no subset
@dexhunter
Copy link
Copy Markdown
Contributor Author

Closing this PR — on reflection, the timing doesn't work for chronological priority.

PR #1850 (someone114514) filed several hours earlier with the same PPM-D byte mixture mechanism class at a stronger BPB (1.00495). Per the chronological-priority review process, if the PPM-D class is approved, PR #1850 takes precedence and our 1.0322 wouldn't clear the merge bar over their submission. If the class is ruled out, neither merges.

The interesting work in this submission (gcc-OpenMP parallelization of the PPM-D scoring pass: ~957s → ~190s, making full-val byte-level scoring fit the 600s eval budget) is happy to live in the discussion thread on PR #1850 if useful to that author or anyone else iterating on the class.

Apologies for the noise on the review queue.

@dexhunter dexhunter closed this Apr 27, 2026
sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Apr 27, 2026
… required; PR openai#1848 BPB risk; Day 18 plateau; Session 23

- Merged SOTA still 1.0810 (Day 18, no change since Apr 9)
- PPM-D byte mixture confirmed by dexhunter at 1.0322 (PR openai#1857, self-closed)
- SmearGate BOS bug documented: prev-token leaks at document boundaries; fix required
- PR openai#1848 (newjordan, 0.87980) flagged BPB risk: sibling PR openai#1846 closed same day
- PR openai#1858 (0.9946) only covers 8M/40.5M tokens — not leaderboard-comparable
- PR openai#1855 (codemath3000, 1.06108) and openai#1851 (aquariouseworkman, 1.06128) both clean
- PPM-D wave: PRs openai#1850, openai#1854, openai#1835 await organizer ruling
- Added Session 23 lessons to CLAUDE.md
- 3 days to deadline (Apr 30) — final GPU run window

https://claude.ai/code/session_01RmJtLYUmKNzDgDVTnWoKzU
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 27, 2026
Research note explaining why dexhunter's openai#1857 PPM-D byte mixture works.
Built from openai#1857 seed-42 log diagnostic decomposition (mix=1.0318, ppm=2.34,
nn=1.10, gate=14.24%). Frames the mechanism as complementary specialists
+ λ-gate routing, not a "better model."

Key insight: PPM-only is 2.13× WORSE than NN-only, but the gate-routed
mixture is 0.07 BPB BETTER than NN. The gate fires on 14% of bytes — the
slice where the byte after the previous 4 is determined at >90% conf.
At our 16 MB scale, the model is forced to under-allocate to surface
structure (URLs, numerals, code); PPM provides that for free at eval.

Companion to research/literature/ppm-variants-2026-04-27.md.
Will inform spec 052 (PPM-D port) and the val-inspection helper script.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 27, 2026
Methodology proposal: instead of treating val_bpb as a single number, run
inference on val data and decompose by per-token NLL + surface category.
Outputs an interpretable contribution breakdown:

  URL-like       1.2%  | mean NLL 4.2  | 5.0% of val_bpb
  Numeric        2.9%  | mean NLL 3.8  | 11.0% of val_bpb
  Common prose  93.5%  | mean NLL 0.9  | 84.1% of val_bpb

This tells us BEFORE porting PPM whether our val stream's failure profile
matches dexhunter's openai#1857 (where URL/NUM/CODE absorb ~14% of bytes).
Estimable PPM gain from category contributions, not from blind porting.

Will be implemented as testing/inspect_failures.py. First target:
runs/047B-loop-kv-shrink-screen/final_model.pt.

Companion to research/ideas/nn-ppm-mixture.md.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 27, 2026
…is impl)

The script the val-analysis skill references. Loads a final_model.pt + matching
train_gpt.py, runs forward on val (with varlen attention via _build_cu_seqlens),
optionally applies PPM-D byte mixture, outputs failure-analysis markdown
+ cached .npz of raw NLLs.

ppm_scorer.c extracted from PR openai#1857's embedded source (~270 lines C, OpenMP).

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant