Skip to content

SP8192 + Score-First TTT + QK-Gain 5.25 — Neural-Only val_bpb 1.0810 (3-seed mean)#1858

Open
G3sparky wants to merge 4 commits intoopenai:mainfrom
G3sparky:legal-ppmd-submission
Open

SP8192 + Score-First TTT + QK-Gain 5.25 — Neural-Only val_bpb 1.0810 (3-seed mean)#1858
G3sparky wants to merge 4 commits intoopenai:mainfrom
G3sparky:legal-ppmd-submission

Conversation

@G3sparky
Copy link
Copy Markdown

@G3sparky G3sparky commented Apr 27, 2026

val_bpb = 1.0810 (3-seed mean, std 0.00037). Artifact under 16MB. 8xH100 SXM.

Submitted while tied for #1 on the leaderboard; bigbag's #1920 has since taken the top at 1.0699. This run stands as a clean neural-only legal entry.

3-seed results (neural-only, score-first TTT)

Seed TTT BPB Artifact
42 1.0806 15,996,321 bytes
314 1.0810 15,995,838 bytes
999 1.0814 15,995,930 bytes
Mean 1.0810 std 0.00037

What's new here

  • Score-first TTT (legal): 2-epoch SGD per chunk, C3-compliant per A Field Guide to Valid Submissions #1017
  • GPTQ int6/int8 + Brotli-11 compression
  • Void fraction compass: real-time training diagnostic, stable at 0.510
  • Cross-platform check: 1.0886 on SDPA backend (RunPod 8xH100)

Base

SP8192 + 3-layer depth recurrence + parallel residuals + QK-Gain 5.25.

Credit to @clarkkev #1394, @dexhunter #1331/#1437, @abaybektursun #549, @Robby955 #1412, @msisovic #1204.

How to reproduce

SEED=42 TTT_CHUNK_SIZE=48 torchrun --standalone --nproc_per_node=8 train_gpt.py

Repeat with SEED=314 and SEED=999 for the 3-seed mean. Each seed runs ~600s train + ~600s eval on 8xH100 SXM. We used the default chunk size 48; bigbag #1920 has since shown a small gain at TTT_CHUNK_SIZE=32, which we'll test in a follow-up.

Compliance

C1 causal, C2 standard softmax over full vocab, C3 score-before-update, C4 single pass. All seeds under 16MB, train <600s, eval <600s.

Experimental (pending #1872)

PPM-D byte mixture + anti-hijack gate hits mix_bpb 0.9727 on the 8M subset. Waiting on the C2 normalization call before claiming it. Either way, the 1.0810 above is the submission. PPM-D port from @anmarhindi #1835.

… mean)

Legal score-first TTT (3-epoch SGD per chunk, Issue openai#1017 C3 compliant)
+ PPM-D byte mixture (order-5, binary-lambda gate, score-before-update).
3-seed mean mix_bpb 0.9946 (std 0.0002), all artifacts under 16MB.
Built on SP8192 + 3-layer recurrence + parallel residuals + QK-Gain 5.25.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 27, 2026 12:42
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new Track B (10min/16MB) record submission folder reporting a score-first TTT + PPM-D byte-mixture result (mix_bpb ≈ 0.9946) with associated code, logs, and metadata.

Changes:

  • Introduces a new training/eval script implementing legal score-first TTT and an eval-time PPM-D byte-level mixture.
  • Adds per-seed training/eval logs documenting results and artifact sizes.
  • Adds submission.json + README describing metrics, compliance claims, and reproduction.

Reviewed changes

Copilot reviewed 3 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
records/track_10min_16mb/2026-04-27_ScoreFirstTTT_PPMD_QK525/train_gpt.py Implements training, GPTQ quantization, sliding eval, score-first TTT, and PPM-D byte-mixture plus self-extracting wrapper generation.
records/track_10min_16mb/2026-04-27_ScoreFirstTTT_PPMD_QK525/train_seed42.log Seed 42 run log with metrics, timings, and artifact size.
records/track_10min_16mb/2026-04-27_ScoreFirstTTT_PPMD_QK525/train_seed314.log Seed 314 run log with metrics, timings, and artifact size.
records/track_10min_16mb/2026-04-27_ScoreFirstTTT_PPMD_QK525/train_seed999.log Seed 999 run log with metrics, timings, and artifact size.
records/track_10min_16mb/2026-04-27_ScoreFirstTTT_PPMD_QK525/submission.json Submission metadata (headline BPB, bytes, seeds, and blurb).
records/track_10min_16mb/2026-04-27_ScoreFirstTTT_PPMD_QK525/README.md Human-readable description of the approach, results, and compliance/repro steps.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +568 to +572
random.seed(h.seed);np.random.seed(h.seed);torch.manual_seed(h.seed);torch.cuda.manual_seed_all(h.seed);val_data=ValidationData(h,device);_n_shards=len(list(Path(h.datasets_dir).resolve().glob('fineweb_train_*.bin')));log(f"train_shards: {_n_shards}");log(f"val_tokens: {val_data.val_tokens.numel()-1}");base_model,compiled_model=train_model(h,device,val_data);torch._dynamo.reset();timed_eval('pre-quantization post-ema',eval_val,h,device,val_data,compiled_model)
if h.prequant_ttt_enabled:
base_model=prequant_ttt(h,device,val_data,base_model);torch._dynamo.reset();compiled_model=torch.compile(base_model,dynamic=False,fullgraph=True);timed_eval('pre-quantization post-ttt',eval_val,h,device,val_data,compiled_model)
serialize(h,base_model,Path(__file__).read_text(encoding='utf-8'))
if h.distributed:dist.barrier()
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

train_and_eval() uses Path(__file__).read_text() as the source payload for the self-extracting wrapper. Since serialize() writes a new train_gpt.py during the run, repeated invocations in the same directory can end up embedding the wrapper-in-wrapper instead of the intended full script, changing bytes_code/bytes_total and potentially the runtime behavior. Consider passing a stable source string captured before any file writes (or embedding a constant) to ensure reproducible packaging across seeds.

Copilot uses AI. Check for mistakes.
"github_id": "G3sparky",
"date": "2026-04-27T12:00:00Z",
"val_bpb": 0.9946,
"bytes_total": 15997374,
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bytes_total is lower than the largest per-seed artifact size reported in the logs. train_seed999.log shows Total submission size quantized+brotli: 15997375 bytes, but the top-level bytes_total here is 15997374 (seed 42’s size). Update this field to reflect the maximum/worst-case artifact size across the submitted seeds so the metadata matches the logs.

Suggested change
"bytes_total": 15997374,
"bytes_total": 15997375,

Copilot uses AI. Check for mistakes.
Comment on lines +8 to +9
"bytes_code": 19877,
"blurb": "Legal score-first TTT (3-epoch SGD per chunk) + PPM-D byte mixture (order-5, binary-lambda gate). Neural-only TTT BPB 1.0807, PPM-D mixture pushes to 0.9944. 8xH100 SXM, 3-seed mean 0.9946 BPB (std 0.0002). Built on SP8192 + 3-layer depth recurrence + parallel residuals + QK-Gain 5.25.",
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bytes_code and/or the code-size assumptions in the blurb don’t match the logged outputs across seeds. The logs show the generated bootstrap code varies by seed/run order (e.g., train_seed42.log: 19,877 bytes; train_seed314.log: 20,602 bytes; train_seed999.log: 21,327 bytes). If bytes_code is meant to represent the submitted script size, it should reflect the worst-case size actually produced, and ideally the build should be deterministic across repeated runs/seeds.

Suggested change
"bytes_code": 19877,
"blurb": "Legal score-first TTT (3-epoch SGD per chunk) + PPM-D byte mixture (order-5, binary-lambda gate). Neural-only TTT BPB 1.0807, PPM-D mixture pushes to 0.9944. 8xH100 SXM, 3-seed mean 0.9946 BPB (std 0.0002). Built on SP8192 + 3-layer depth recurrence + parallel residuals + QK-Gain 5.25.",
"bytes_code": 21327,
"blurb": "Legal score-first TTT (3-epoch SGD per chunk) + PPM-D byte mixture (order-5, binary-lambda gate). Neural-only TTT BPB 1.0807, PPM-D mixture pushes to 0.9944. 8xH100 SXM, 3-seed mean 0.9946 BPB (std 0.0002). Built on SP8192 + 3-layer depth recurrence + parallel residuals + QK-Gain 5.25. LZMA self-extracting bootstrap size varies by seed/run order; bytes_code reports the worst observed wrapper size (21,327 bytes).",

Copilot uses AI. Check for mistakes.
Comment on lines +357 to +371
def serialize(h,base_model,code):
# LZMA-compress the code wrapper — saves ~37KB (52KB raw → ~15KB compressed)
code_raw=code.encode('utf-8');code_compressed=lzma.compress(code_raw,preset=9)
# Self-extracting wrapper: tiny bootstrap that decompresses and exec's the real code
bootstrap=f"import lzma,base64 as B;exec(lzma.decompress(B.b85decode({repr(base64.b85encode(code_compressed).decode())})))".encode('utf-8')
code_bytes=len(bootstrap);log(f"Code: {len(code_raw)} raw → {len(code_compressed)} lzma → {code_bytes} bootstrap")
if h.is_main_process:
bootstrap_path=Path(h.quantized_model_path).parent/'train_gpt.py'
with open(bootstrap_path,'wb')as f:f.write(bootstrap)
log(f"Wrote bootstrap code to {bootstrap_path} ({code_bytes} bytes)")
torch.save(base_model.state_dict(),h.model_path);model_bytes=os.path.getsize(h.model_path);log(f"Serialized model: {model_bytes} bytes");log(f"Code size: {code_bytes} bytes")
sd_cpu={k:v.detach().cpu()for(k,v)in base_model.state_dict().items()};device=torch.device('cuda',h.local_rank);log('GPTQ:collecting Hessians from calibration data...');t0=time.perf_counter();calib_loader=ShuffledSequenceLoader(h,device);hessians=collect_hessians(base_model,calib_loader,h,device,n_calibration_batches=h.gptq_calibration_batches);log(f"GPTQ:collected {len(hessians)} Hessians in {time.perf_counter()-t0:.1f}s");quant_result,quant_meta=gptq_mixed_quantize(sd_cpu,hessians,h);quant_buf=io.BytesIO();torch.save({'w':quant_result,'m':quant_meta},quant_buf);quant_raw=quant_buf.getvalue();quant_blob=_compress(quant_raw,h.compressor);quant_file_bytes=len(quant_blob);bytes_total=quant_file_bytes+code_bytes
if h.is_main_process:
with open(h.quantized_model_path,'wb')as f:f.write(quant_blob)
log(f"Serialized model quantized+{h.compressor}: {quant_file_bytes} bytes");log(f"Total submission size quantized+{h.compressor}: {bytes_total} bytes")
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

serialize() overwrites train_gpt.py on disk (the file being executed) with the bootstrap wrapper. Because train_and_eval() passes Path(__file__).read_text(...) into serialize(), subsequent runs (or later seeds in the same working directory) will read/pack the bootstrap rather than the full source, which matches the logs where code sizes change across seeds. To keep artifacts deterministic and independent of run order, avoid reading __file__ after self-overwrite (e.g., capture the full source once at startup, or write the bootstrap to a different filename instead of clobbering the entrypoint).

Copilot uses AI. Check for mistakes.
@dexhunter
Copy link
Copy Markdown
Contributor

Hi @G3sparky, congrats on the engineering — the score-first PPM-D ordering looks structurally correct (different from PR #1852's pre-quant TTT issue). One thing to flag for reviewers though:

Looking at the seed logs, mix_bpb=0.9946 appears to be computed over the first 8M tokens of the val set rather than the full ~40.5M-token val (per PPM_SUBSET_TOKENS=8000000 default + subset=8000000 tokens in each seed log line). That's roughly 20% of the val data; the non-PPM diagnostics (quantized_ttt val_bpb=1.0811, quantized_sliding_window val_bpb=1.0824) are computed over full val and use canonical sp8192 byte counts.

For the leaderboard's full-val byte-level BPB metric (Section V — "byte-level BPB via sentencepiece piece table, full val shards"), reviewers will likely want to see the mixture computed over all ~40.5M tokens / ~151M bytes. Projecting the observed PPM-D gain (~0.09 BPB on subset) onto full val would give roughly 1.08 − 0.09 ≈ 0.99 BPB — still a strong result if the gain holds, but the headline 0.9946 is only directly comparable to PR #1854 (which uses the same 8M subset).

Also worth checking: seed 42 log shows quantized_ttt eval_time:473727ms which puts that seed's total eval at ~610s, slightly over the 600s eval cap. Easy to fix by trimming TTT prefix or reducing PPM cache.

Constructive intent — this is in the same class as PR #1795/#1850/#1835 and the legality of PPM-D byte mixture itself is still UNRULED, so the more rigorous the metric reporting the better for the class as a whole. Happy to help compare numbers if useful.

@G3sparky
Copy link
Copy Markdown
Author

@dexhunter thankyou for your feedback, working on the fixes now.

@G3sparky
Copy link
Copy Markdown
Author

@dexhunter

Both catches are right, thanks for taking the time.
Confirmed on the 8M subset — the headline 0.9946 is only directly comparable to PR #1854. I'll re-run with full ~40.5M-token val coverage so the number is comparable to the leaderboard metric.

Your projection of ~0.99 BPB on full val is the right ballpark for what to expect if the gain holds at scale.

Eval time is also tight as you flagged — seed 42's TTT alone was 473.7s, which leaves ~126s for full-val PPM on top.

Two options I'm weighing:
Trim TTT (fewer epochs per chunk, or shorter prefix) to free up budget
Port PPM to native C/OpenMP — your #1857 numbers (95–190s) would solve this cleanly and leave headroom

Leaning toward option 2 because it preserves the TTT contribution.

Would you be open to me building on your C implementation for the port, with attribution?

Genuinely interested in comparing notes:
Whether the ~0.09 BPB PPM gain holds on full val for you
Order-5 vs order-4 — I picked 5 empirically without a careful sweep
Whether the binary-lambda gate behaves differently at full-val scale
Happy to share our seed logs, hyperparams, anything useful.

The legality question on PPM-D byte mixture is still open across the class of submissions and the more rigorous everyone's reporting is, the better for getting it ruled on.

sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Apr 27, 2026
… required; PR openai#1848 BPB risk; Day 18 plateau; Session 23

- Merged SOTA still 1.0810 (Day 18, no change since Apr 9)
- PPM-D byte mixture confirmed by dexhunter at 1.0322 (PR openai#1857, self-closed)
- SmearGate BOS bug documented: prev-token leaks at document boundaries; fix required
- PR openai#1848 (newjordan, 0.87980) flagged BPB risk: sibling PR openai#1846 closed same day
- PR openai#1858 (0.9946) only covers 8M/40.5M tokens — not leaderboard-comparable
- PR openai#1855 (codemath3000, 1.06108) and openai#1851 (aquariouseworkman, 1.06128) both clean
- PPM-D wave: PRs openai#1850, openai#1854, openai#1835 await organizer ruling
- Added Session 23 lessons to CLAUDE.md
- 3 days to deadline (Apr 30) — final GPU run window

https://claude.ai/code/session_01RmJtLYUmKNzDgDVTnWoKzU
GodlyDonuts added a commit to GodlyDonuts/parameter-golf that referenced this pull request Apr 28, 2026
…olar Express NS + MIN_LR + LQER)

Triage of 5 new PRs the user surfaced (1858, 1852, 1855, 1874, 1877):
- openai#1852: hard rule violation (pre-quant TTT on validation data).
- openai#1858: eval subset (8M of 40.5M tokens), reviewer caught and author admitted.
- openai#1877: broken normalization (byte PPM × token NN doesn't sum to 1 over
  token alphabet), reviewer @sharpobject caught.
- openai#1855: techniques mostly legit but apt-get install lrzip violates Issue
  openai#1017 Rule 3 (artifact must be self-contained).
- openai#1874: LEGITIMATE - 3-seed mean 1.06766, std 0.00076, three orthogonal
  training-time techniques citing prior validated PRs. If it merges,
  our submission threshold shifts from 1.0760 to ~1.0627.

PR openai#1874's three techniques:
1. Polar Express NS coefficients (PR openai#1344) - 5 minimax-tuned tuples
   replace the fixed (3.4445, -4.775, 2.0315) at MUON_BACKEND_STEPS=5.
2. MIN_LR=0.10 warmdown floor (PR openai#1787) - LR floors at 10% of max
   instead of decaying to 0. Already wired in our v1+; just env-var
   opt-in.
3. LQER asymmetric int4 rank-4 quantization correction (PR openai#1797) -
   SVD on top-K=3 highest-error GPTQ residuals, packed as int4
   per-group-64 asymmetric. ~200-400 LOC; deferred to v4.

train_gpt_v3.py implements (1) and exposes (2):
- POLAR_EXPRESS_NS=0 default (byte-for-byte SOTA when off).
- _PE_COEFFS module-level constant + _POLAR_EXPRESS_NS flag read at
  import time so torch.compile sees them as constants.
- zeropower_via_newtonschulz5 branches on _POLAR_EXPRESS_NS to use
  per-iteration coefficients instead of fixed.
- MIN_LR was already an env var; setting MIN_LR=0.10 at runtime opts in.

Sizes: v3 raw 54,977 lzma 15,128 (+272 vs v2, +1,880 vs SOTA). Worst-
seed artifact slack: ~4,888 bytes under cap. Tight but workable.

AST-validated on Python 3.13 (macOS) and 3.12 (Vultr Linux).

Stacking projection (single-seed):
- Phase 0 baseline:       1.08038
- + LR=0.010 (Stage 2):   1.08021
- + Polar Express NS:     1.0787-1.0797
- + MIN_LR=0.10:          1.0777-1.0794
- + ConfTTT (PR openai#1879):   1.0772-1.0793
- + LQER (v4 work):       1.0742-1.0783
- + Phase 2 architecture: 1.0712-1.0773
- + Newton-Muon Stage E:  1.066-1.075

Path B (absorb-and-stack) recommended over Path A (race-to-merge-with-
current-stack) since current stack alone doesn't clear 1.0760.

Race awareness: openai#1874, openai#1855 (lrzip-stripped), and openai#1797 are all open.
Whichever merges first becomes new SOTA and our threshold tightens.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
G3sparky and others added 2 commits April 28, 2026 16:49
…mpass

Fixes all Copilot + Dex review comments:
- Source captured at startup for deterministic bootstrap (20,092 bytes)
- TTT reduced to 2 epochs: eval time 350-387s (well under 600s)
- Void fraction compass logged as diagnostic (0.510 stable)
- 3-seed mean: 0.9946 BPB (std 0.0003), all under 16MB

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Anti-hijack gate: suppress PPM when NN NLL < 0.277 nats (0.40 bits).
3-seed mean: 0.9727 BPB (8M subset), gate_skip ~30.5%.
Improved from 0.9946 — gate is both defensive and beneficial.

Honest disclosures:
- PPM-D evaluated on 8M token subset (noted in val_bpb_note)
- Neural-only fallback: 1.0806 BPB (full val)
- Issue openai#1872 PPM-D class risk acknowledged explicitly
- Not claiming C2 compliance — claiming good-faith engineering

Peer reviewed: Tron (number audit), Flynn (gate verify), Lauren (sign-off).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Lead with neural-only 3-seed mean 1.0810 BPB (quantized+TTT).
PPM-D 0.9727 moved to experimental section (pending openai#1872).
Added cross-platform SDPA verification (1.0886 BPB).
Per-seed numbers verified by Tron against run15 gate logs.

Peer reviewed: Tron PASS, Flynn PASS, Lauren PASS.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@G3sparky G3sparky changed the title Record: Score-First TTT + PPM-D Byte Mixture — mix_bpb 0.9946 (3-seed mean) SP8192 + Score-First TTT + QK-Gain 5.25 — Neural-Only val_bpb 1.0810 (3-seed mean) Apr 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants