Skip to content

Record: PR #1854 neural stack — budget-compliant 1.06777 (3-seed mean)#1883

Open
robbiebusinessacc wants to merge 2 commits intoopenai:mainfrom
robbiebusinessacc:submission/multibin-lambda
Open

Record: PR #1854 neural stack — budget-compliant 1.06777 (3-seed mean)#1883
robbiebusinessacc wants to merge 2 commits intoopenai:mainfrom
robbiebusinessacc:submission/multibin-lambda

Conversation

@robbiebusinessacc
Copy link
Copy Markdown

Summary

3-seed validated reproduction of PR #1854's neural stack with PHASED_TTT_PREFIX_DOCS reduced from 2000 → 1500 to fit cleanly under the 600s evaluation budget. val_bpb 1.06777 (3-seed mean, std 0.00106) on 8×H100 SXM.

vs merged-leaderboard SOTA PR #1493 (@bigbag, 1.0810): −0.01323 BPB at ~13σ statistical significance, p ≪ 0.0001 against the 0.005-nat threshold.

Seed val_bpb Total bytes Eval time
42 1.06686 15,952,086 374.6s
1337 1.06893 15,949,941 371.0s
314 1.06752 15,951,195 327.7s
Mean 1.06777 15,951,074 357.8s

Compliance

  • All 3 artifacts under 16,000,000 bytes (max 15,952,086, margin 47,914)
  • All 3 eval times under 600s (max 374.6s, margin 225.4s)
  • Training cap-bound at 600s, all 3 seeds
  • Headline val_bpb is the standard token-level NLL → byte path. No byte-PPM mixture is claimed. The exploratory multibin-λ refinement of PR Record: SP8192 + PPM-D byte mixture — 1.00136 BPB (3-seed mean) #1835's mixer is included in train_gpt.py for reproducibility but its mix_bpb is not the reported number — see README "Note on byte-PPM mixture" for the reasoning.
  • Score-first phased TTT only (PR Record: SP8192 + QK-Gain 5 + Legal Score-First TTT — val_bpb 1.08279 (3-seed mean) #1413 lineage). No pre-quant TTT, no SLOT, no n-gram cache, no logit bias.
  • CaseOps tokenizer byte counting via the fineweb_val_bytes_*.bin sidecar that recovers original UTF-8 byte counts; the inflated piece.encode() path is explicitly bypassed (train_gpt.py:387-389, 2618-2626). Full audit and Eppie/mhuen-style normalization proof in the README.

What's new vs PR #1854

PR #1854's reported eval wallclock is ~700s (per its own log: ttt_phased 516s + ppm_mix 116s + diagnostics 67s), over the 600s budget. This submission reproduces the same neural stack with PHASED_TTT_PREFIX_DOCS=1500 and lands at the same post-TTT val_bpb (~1.067) cleanly under 600s. Closed PRs in #677 cite eval over-budget as grounds for rejection (e.g. PR #503), so a budget-compliant 1.067 is a more defensible record candidate.

Files

records/track_10min_16mb/2026-04-28_PR1854_BudgetCompliant_1.0678/

  • README.md — full methodology, normalization proof, lineage and credits
  • submission.json — metadata
  • train_gpt.py — neural stack (mixer-independent headline path)
  • lossless_caps.py, prepare_caseops_data.py, tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model — verbatim from PR Record: PR #1797 base + PPM-D byte mixture — val_bpb 0.90236 (3-seed mean) #1854
  • train_seed{42,1337,314}.log — per-seed train+eval logs
  • final_model.int6.ptz — quantized model artifact

Test plan

  • Trains within 600s wallclock on 8×H100 80GB SXM (cap-bound, all 3 seeds)
  • All 3 artifacts under 16 MB cap
  • Eval completes within 600s wallclock cap (max 374.6s)
  • 3-seed mean reproduced; per-seed numbers verified in attached logs
  • Full-vocab softmax normalization (standard F.cross_entropy over V=8192) — README §"Normalization proof"
  • Byte denominator equals original UTF-8 bytes via sidecar — README §"Normalization proof" (2)

robbiebusinessacc and others added 2 commits April 28, 2026 00:59
…tion — val_bpb 1.06777 (3-seed mean)

3-seed validated reproduction of PR openai#1854's neural stack with PHASED_TTT_PREFIX_DOCS=1500 to fit the 600s eval budget. Beats merged SOTA PR openai#1493 (bigbag, 1.0810) by 0.01323 BPB at ~13σ statistical significance.

Reported val_bpb is the standard token-level NLL → byte conversion (no byte-PPM mixture claimed). The exploratory multibin-λ refinement of PR openai#1835's mixer is included in train_gpt.py for completeness but its mix_bpb is not the headline claim, due to an open community question on byte-spread normalization vs Kraft compliance.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant