Legality clarification: byte-level PPM-D mixture submissions (#1835 / #1850 / #1854 cluster) under Issue #1017 C2

# Legality clarification: byte-level PPM-D mixture submissions (#1835 / #1850 / #1854 cluster) under Issue #1017 C2

hi @cocohearts there's a few recent record submissions like #1835, #1850, #1854, #1858, #1862, #1833, #1871, and #1865 reporting val_bpb in the 1.00 to 0.85 range using byte-level PPM-D mixtures.

## what the submissions do

according to #1835 and ports in #1854 / #1858:

1. NN side: standard sliding-window token scoring (unchanged from base
   record). Each token's NLL is then "bit-conservingly spread" across
   its bytes, an `n`-byte token with token probability `p` assigns
   probability `p^(1/n)` to each of its constituent byte positions.
2. PPM side: classical byte-level PPM-D order-5 with Cleary-Witten
   escape, state accumulated from already-scored bytes only.
3. Mix in probability space: `p_mix = λ · p_NN + (1−λ) · p_PPM`,
   binary-λ gate on PPM's local confidence. Score
   `−log p_mix(realized_byte_t)`. Counts updated AFTER scoring.

## I think C1, C3, C4 are clean

- **C1 (causality):** PPM context at byte `t` uses bytes `<t` only. ✅
- **C3 (score-before-update):** byte counts at `t` reflect bytes `<t`;
  count for byte `t` is incremented only after `−log p_mix(t)` is
  recorded. ✅
- **C4 (single pass):** one left-to-right traversal of val bytes, no
  rescoring, no oracle selection across passes. ✅

## C2 is questionable

Issue #1017 III defines C2 as a normalized distribution over "the
official fixed token alphabet Σ". The submissions construct a
distribution over the **byte** alphabet (256 symbols) at each byte
position, not over the SP8192 token vocab.

Two possible readings:

**(a) Σ = the SP8192 token vocab.** Then the mixture isn't a
token-level distribution at all. The NN side's per-byte value is a
scalar functional of the neural token distribution, which the
common-violations table in #1017 VI flags directly:
> Entropy expert in context mixer (scalar functional of neural dist,
> not a distribution over Σ) — Condition 2 violation

Under this reading, every submission in the PPM-D cluster fails C2.

**(b) Σ = the byte alphabet.** Then `p_NN` (via bit-conserving spread)
and `p_PPM` (classical PPM-D) are both normalized over 256 symbols,
their convex combination is also normalized, and C2 is satisfied.
Under this reading the cluster is legal.

The contest's BPB formula in #1017 V (`val_bpb =
(total_cross_entropy_nats / log 2) × (token_count / byte_count)`) is
written assuming token-level scoring converted via the LUT. A
byte-level scoring path is a different evaluation procedure even when
the byte stream and total bits are identical.

whichever way the ruling goes, having it on the record
will save a lot of agent-cycles between now and April 30.

(also i'm not affiliated with any of the submissions linked above i'm just opening
this so the queue clears.)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Legality clarification: byte-level PPM-D mixture submissions (#1835 / #1850 / #1854 cluster) under Issue #1017 C2 #1872

Legality clarification: byte-level PPM-D mixture submissions (#1835 / #1850 / #1854 cluster) under Issue #1017 C2

what the submissions do

I think C1, C3, C4 are clean

C2 is questionable

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Legality clarification: byte-level PPM-D mixture submissions (#1835 / #1850 / #1854 cluster) under Issue #1017 C2 #1872

Description

Legality clarification: byte-level PPM-D mixture submissions (#1835 / #1850 / #1854 cluster) under Issue #1017 C2

what the submissions do

I think C1, C3, C4 are clean

C2 is questionable

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions