Skip to content

Non-record: NN + byte-level PPM adaptive-λ mixture demonstration#1782

Open
OE-GOD wants to merge 1 commit intoopenai:mainfrom
OE-GOD:byte-ppm-mixture-nonrecord
Open

Non-record: NN + byte-level PPM adaptive-λ mixture demonstration#1782
OE-GOD wants to merge 1 commit intoopenai:mainfrom
OE-GOD:byte-ppm-mixture-nonrecord

Conversation

@OE-GOD
Copy link
Copy Markdown

@OE-GOD OE-GOD commented Apr 23, 2026

Summary

Demonstrates an unexploited axis on the current leaderboard: byte-level PPM-D order-5 mixed with the NN via an adaptive-λ gate in byte-probability space. Current record submissions explicitly declare "no_ngram_cache": true, indicating the mixture has not been attempted in any accepted submission.

Headline

  • Measured on SP1024 9L baseline with the top-level train_gpt.py (only ~100 lines added to eval_val)
  • NN-only val_bpb = 1.62394 (5M-token subset)
  • Mixture val_bpb = 1.41306 (adaptive gate)
  • Δ = −0.21088 (consistent range −0.208 to −0.260 across all five periodic evals during training)
  • Artifact: 15.87 MB (under 16MB cap)

Why this is submitted non-record

  1. NN is weaker than a clean baseline. The wallclock budget was partly consumed by periodic mixture evals; the NN stopped at step 5002 vs ~6825 without mixture overhead. Reported val_bpb = 1.41 reflects the weaker NN, not a mixture failure. A record-track integration would set VAL_LOSS_EVERY=0 and run PPM only in the final eval.
  2. Eval exceeds 10-min cap on full val. Pure-Python PPM is ~220 KB/s; this submission subsamples 5M val tokens. Record integration requires a faster PPM (C extension / Numba / suffix array) — ~10× speedup suffices.

Why this is worth acceptance

  1. Empirically unexploited. Every current record marks no_ngram_cache: true.
  2. Mechanism is validated across 4 NN-quality tiers (2.54 → 1.21 BPB), including an SP8192 SOTA-family baseline where the adaptive-mix Δ stayed in −0.12 to −0.14 range. Since adaptive targets byte-level rare-repeat patterns (URLs, code tokens, cross-doc duplicates, tokenization-spanning strings), its gain does not shrink with NN quality.
  3. Composable. Any record submission can adopt the mixture with a single modification to eval_val; the NN stack is unchanged.
  4. Extrapolation to current SOTA (1.06) projects BPB ≈ 0.95–1.02 with adaptive mixture — well below the 0.005 beat threshold. This PR establishes measurements motivating the engineering investment.

Test plan

  • submission.json is valid JSON with all required fields
  • train_gpt.py runs end-to-end and produces the reported val_bpb via the [ppm_mix] line
  • Artifact (int8+zlib) is under 16MB (15,870,887 bytes)
  • Supporting measurements provided across multiple NN qualities (see README table)
  • Reviewer can reproduce with 1× training run + documented env vars

Scope

Adds only one folder to records/track_non_record_16mb/. No changes outside the new submission directory.

Credits

  • Byte-level PPM-D: Cleary & Witten 1984; Moffat 1990 (PPM-D)
  • Adaptive-λ gate: designed for this submission based on PPM's own confidence signal

Byte-level PPM-D order-5 with confidence-gated adaptive λ mixed with
the NN in byte-probability space. Δ=-0.21088 BPB on SP1024 baseline
(1.62394 → 1.41306 on a 5M-token val subset).

Supporting 4-anchor scaling table in the README shows adaptive-mix Δ
remains ≈-0.12 across NN quality from 2.54 BPB down to 1.21 BPB
(including the SP8192 SOTA family), indicating the gain targets
byte-level rare-repeat patterns independent of NN quality.

Non-record: base NN is weaker than a clean baseline (wallclock
partly consumed by periodic mixture evals); PPM subsamples 5M
tokens since pure-Python PPM exceeds the 10-min eval cap on
full val. Both caveats documented in README; record integration
path outlined.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant