Skip to content

Legality clarification: byte-level PPM-D mixture submissions (#1835 / #1850 / #1854 cluster) under Issue #1017 C2 #1872

@andrewbaggio1

Description

@andrewbaggio1

Legality clarification: byte-level PPM-D mixture submissions (#1835 / #1850 / #1854 cluster) under Issue #1017 C2

hi @cocohearts there's a few recent record submissions like #1835, #1850, #1854, #1858, #1862, #1833, #1871, and #1865 reporting val_bpb in the 1.00 to 0.85 range using byte-level PPM-D mixtures.

what the submissions do

according to #1835 and ports in #1854 / #1858:

  1. NN side: standard sliding-window token scoring (unchanged from base
    record). Each token's NLL is then "bit-conservingly spread" across
    its bytes, an n-byte token with token probability p assigns
    probability p^(1/n) to each of its constituent byte positions.
  2. PPM side: classical byte-level PPM-D order-5 with Cleary-Witten
    escape, state accumulated from already-scored bytes only.
  3. Mix in probability space: p_mix = λ · p_NN + (1−λ) · p_PPM,
    binary-λ gate on PPM's local confidence. Score
    −log p_mix(realized_byte_t). Counts updated AFTER scoring.

I think C1, C3, C4 are clean

  • C1 (causality): PPM context at byte t uses bytes <t only. ✅
  • C3 (score-before-update): byte counts at t reflect bytes <t;
    count for byte t is incremented only after −log p_mix(t) is
    recorded. ✅
  • C4 (single pass): one left-to-right traversal of val bytes, no
    rescoring, no oracle selection across passes. ✅

C2 is questionable

Issue #1017 III defines C2 as a normalized distribution over "the
official fixed token alphabet Σ". The submissions construct a
distribution over the byte alphabet (256 symbols) at each byte
position, not over the SP8192 token vocab.

Two possible readings:

(a) Σ = the SP8192 token vocab. Then the mixture isn't a
token-level distribution at all. The NN side's per-byte value is a
scalar functional of the neural token distribution, which the
common-violations table in #1017 VI flags directly:

Entropy expert in context mixer (scalar functional of neural dist,
not a distribution over Σ) — Condition 2 violation

Under this reading, every submission in the PPM-D cluster fails C2.

(b) Σ = the byte alphabet. Then p_NN (via bit-conserving spread)
and p_PPM (classical PPM-D) are both normalized over 256 symbols,
their convex combination is also normalized, and C2 is satisfied.
Under this reading the cluster is legal.

The contest's BPB formula in #1017 V (val_bpb = (total_cross_entropy_nats / log 2) × (token_count / byte_count)) is
written assuming token-level scoring converted via the LUT. A
byte-level scoring path is a different evaluation procedure even when
the byte stream and total bits are identical.

whichever way the ruling goes, having it on the record
will save a lot of agent-cycles between now and April 30.

(also i'm not affiliated with any of the submissions linked above i'm just opening
this so the queue clears.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions