SP8192 + Score-First TTT + QK-Gain 5.25 — Neural-Only val_bpb 1.0810 (3-seed mean)#1858
SP8192 + Score-First TTT + QK-Gain 5.25 — Neural-Only val_bpb 1.0810 (3-seed mean)#1858G3sparky wants to merge 4 commits intoopenai:mainfrom
Conversation
… mean) Legal score-first TTT (3-epoch SGD per chunk, Issue openai#1017 C3 compliant) + PPM-D byte mixture (order-5, binary-lambda gate, score-before-update). 3-seed mean mix_bpb 0.9946 (std 0.0002), all artifacts under 16MB. Built on SP8192 + 3-layer recurrence + parallel residuals + QK-Gain 5.25. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a new Track B (10min/16MB) record submission folder reporting a score-first TTT + PPM-D byte-mixture result (mix_bpb ≈ 0.9946) with associated code, logs, and metadata.
Changes:
- Introduces a new training/eval script implementing legal score-first TTT and an eval-time PPM-D byte-level mixture.
- Adds per-seed training/eval logs documenting results and artifact sizes.
- Adds
submission.json+ README describing metrics, compliance claims, and reproduction.
Reviewed changes
Copilot reviewed 3 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| records/track_10min_16mb/2026-04-27_ScoreFirstTTT_PPMD_QK525/train_gpt.py | Implements training, GPTQ quantization, sliding eval, score-first TTT, and PPM-D byte-mixture plus self-extracting wrapper generation. |
| records/track_10min_16mb/2026-04-27_ScoreFirstTTT_PPMD_QK525/train_seed42.log | Seed 42 run log with metrics, timings, and artifact size. |
| records/track_10min_16mb/2026-04-27_ScoreFirstTTT_PPMD_QK525/train_seed314.log | Seed 314 run log with metrics, timings, and artifact size. |
| records/track_10min_16mb/2026-04-27_ScoreFirstTTT_PPMD_QK525/train_seed999.log | Seed 999 run log with metrics, timings, and artifact size. |
| records/track_10min_16mb/2026-04-27_ScoreFirstTTT_PPMD_QK525/submission.json | Submission metadata (headline BPB, bytes, seeds, and blurb). |
| records/track_10min_16mb/2026-04-27_ScoreFirstTTT_PPMD_QK525/README.md | Human-readable description of the approach, results, and compliance/repro steps. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| random.seed(h.seed);np.random.seed(h.seed);torch.manual_seed(h.seed);torch.cuda.manual_seed_all(h.seed);val_data=ValidationData(h,device);_n_shards=len(list(Path(h.datasets_dir).resolve().glob('fineweb_train_*.bin')));log(f"train_shards: {_n_shards}");log(f"val_tokens: {val_data.val_tokens.numel()-1}");base_model,compiled_model=train_model(h,device,val_data);torch._dynamo.reset();timed_eval('pre-quantization post-ema',eval_val,h,device,val_data,compiled_model) | ||
| if h.prequant_ttt_enabled: | ||
| base_model=prequant_ttt(h,device,val_data,base_model);torch._dynamo.reset();compiled_model=torch.compile(base_model,dynamic=False,fullgraph=True);timed_eval('pre-quantization post-ttt',eval_val,h,device,val_data,compiled_model) | ||
| serialize(h,base_model,Path(__file__).read_text(encoding='utf-8')) | ||
| if h.distributed:dist.barrier() |
There was a problem hiding this comment.
train_and_eval() uses Path(__file__).read_text() as the source payload for the self-extracting wrapper. Since serialize() writes a new train_gpt.py during the run, repeated invocations in the same directory can end up embedding the wrapper-in-wrapper instead of the intended full script, changing bytes_code/bytes_total and potentially the runtime behavior. Consider passing a stable source string captured before any file writes (or embedding a constant) to ensure reproducible packaging across seeds.
| "github_id": "G3sparky", | ||
| "date": "2026-04-27T12:00:00Z", | ||
| "val_bpb": 0.9946, | ||
| "bytes_total": 15997374, |
There was a problem hiding this comment.
bytes_total is lower than the largest per-seed artifact size reported in the logs. train_seed999.log shows Total submission size quantized+brotli: 15997375 bytes, but the top-level bytes_total here is 15997374 (seed 42’s size). Update this field to reflect the maximum/worst-case artifact size across the submitted seeds so the metadata matches the logs.
| "bytes_total": 15997374, | |
| "bytes_total": 15997375, |
| "bytes_code": 19877, | ||
| "blurb": "Legal score-first TTT (3-epoch SGD per chunk) + PPM-D byte mixture (order-5, binary-lambda gate). Neural-only TTT BPB 1.0807, PPM-D mixture pushes to 0.9944. 8xH100 SXM, 3-seed mean 0.9946 BPB (std 0.0002). Built on SP8192 + 3-layer depth recurrence + parallel residuals + QK-Gain 5.25.", |
There was a problem hiding this comment.
bytes_code and/or the code-size assumptions in the blurb don’t match the logged outputs across seeds. The logs show the generated bootstrap code varies by seed/run order (e.g., train_seed42.log: 19,877 bytes; train_seed314.log: 20,602 bytes; train_seed999.log: 21,327 bytes). If bytes_code is meant to represent the submitted script size, it should reflect the worst-case size actually produced, and ideally the build should be deterministic across repeated runs/seeds.
| "bytes_code": 19877, | |
| "blurb": "Legal score-first TTT (3-epoch SGD per chunk) + PPM-D byte mixture (order-5, binary-lambda gate). Neural-only TTT BPB 1.0807, PPM-D mixture pushes to 0.9944. 8xH100 SXM, 3-seed mean 0.9946 BPB (std 0.0002). Built on SP8192 + 3-layer depth recurrence + parallel residuals + QK-Gain 5.25.", | |
| "bytes_code": 21327, | |
| "blurb": "Legal score-first TTT (3-epoch SGD per chunk) + PPM-D byte mixture (order-5, binary-lambda gate). Neural-only TTT BPB 1.0807, PPM-D mixture pushes to 0.9944. 8xH100 SXM, 3-seed mean 0.9946 BPB (std 0.0002). Built on SP8192 + 3-layer depth recurrence + parallel residuals + QK-Gain 5.25. LZMA self-extracting bootstrap size varies by seed/run order; bytes_code reports the worst observed wrapper size (21,327 bytes).", |
| def serialize(h,base_model,code): | ||
| # LZMA-compress the code wrapper — saves ~37KB (52KB raw → ~15KB compressed) | ||
| code_raw=code.encode('utf-8');code_compressed=lzma.compress(code_raw,preset=9) | ||
| # Self-extracting wrapper: tiny bootstrap that decompresses and exec's the real code | ||
| bootstrap=f"import lzma,base64 as B;exec(lzma.decompress(B.b85decode({repr(base64.b85encode(code_compressed).decode())})))".encode('utf-8') | ||
| code_bytes=len(bootstrap);log(f"Code: {len(code_raw)} raw → {len(code_compressed)} lzma → {code_bytes} bootstrap") | ||
| if h.is_main_process: | ||
| bootstrap_path=Path(h.quantized_model_path).parent/'train_gpt.py' | ||
| with open(bootstrap_path,'wb')as f:f.write(bootstrap) | ||
| log(f"Wrote bootstrap code to {bootstrap_path} ({code_bytes} bytes)") | ||
| torch.save(base_model.state_dict(),h.model_path);model_bytes=os.path.getsize(h.model_path);log(f"Serialized model: {model_bytes} bytes");log(f"Code size: {code_bytes} bytes") | ||
| sd_cpu={k:v.detach().cpu()for(k,v)in base_model.state_dict().items()};device=torch.device('cuda',h.local_rank);log('GPTQ:collecting Hessians from calibration data...');t0=time.perf_counter();calib_loader=ShuffledSequenceLoader(h,device);hessians=collect_hessians(base_model,calib_loader,h,device,n_calibration_batches=h.gptq_calibration_batches);log(f"GPTQ:collected {len(hessians)} Hessians in {time.perf_counter()-t0:.1f}s");quant_result,quant_meta=gptq_mixed_quantize(sd_cpu,hessians,h);quant_buf=io.BytesIO();torch.save({'w':quant_result,'m':quant_meta},quant_buf);quant_raw=quant_buf.getvalue();quant_blob=_compress(quant_raw,h.compressor);quant_file_bytes=len(quant_blob);bytes_total=quant_file_bytes+code_bytes | ||
| if h.is_main_process: | ||
| with open(h.quantized_model_path,'wb')as f:f.write(quant_blob) | ||
| log(f"Serialized model quantized+{h.compressor}: {quant_file_bytes} bytes");log(f"Total submission size quantized+{h.compressor}: {bytes_total} bytes") |
There was a problem hiding this comment.
serialize() overwrites train_gpt.py on disk (the file being executed) with the bootstrap wrapper. Because train_and_eval() passes Path(__file__).read_text(...) into serialize(), subsequent runs (or later seeds in the same working directory) will read/pack the bootstrap rather than the full source, which matches the logs where code sizes change across seeds. To keep artifacts deterministic and independent of run order, avoid reading __file__ after self-overwrite (e.g., capture the full source once at startup, or write the bootstrap to a different filename instead of clobbering the entrypoint).
|
Hi @G3sparky, congrats on the engineering — the score-first PPM-D ordering looks structurally correct (different from PR #1852's pre-quant TTT issue). One thing to flag for reviewers though: Looking at the seed logs, For the leaderboard's full-val byte-level BPB metric (Section V — "byte-level BPB via sentencepiece piece table, full val shards"), reviewers will likely want to see the mixture computed over all ~40.5M tokens / ~151M bytes. Projecting the observed PPM-D gain (~0.09 BPB on subset) onto full val would give roughly 1.08 − 0.09 ≈ 0.99 BPB — still a strong result if the gain holds, but the headline 0.9946 is only directly comparable to PR #1854 (which uses the same 8M subset). Also worth checking: seed 42 log shows Constructive intent — this is in the same class as PR #1795/#1850/#1835 and the legality of PPM-D byte mixture itself is still UNRULED, so the more rigorous the metric reporting the better for the class as a whole. Happy to help compare numbers if useful. |
|
@dexhunter thankyou for your feedback, working on the fixes now. |
|
Both catches are right, thanks for taking the time. Your projection of ~0.99 BPB on full val is the right ballpark for what to expect if the gain holds at scale. Eval time is also tight as you flagged — seed 42's TTT alone was 473.7s, which leaves ~126s for full-val PPM on top. Two options I'm weighing: Leaning toward option 2 because it preserves the TTT contribution. Would you be open to me building on your C implementation for the port, with attribution? Genuinely interested in comparing notes: The legality question on PPM-D byte mixture is still open across the class of submissions and the more rigorous everyone's reporting is, the better for getting it ruled on. |
… required; PR openai#1848 BPB risk; Day 18 plateau; Session 23 - Merged SOTA still 1.0810 (Day 18, no change since Apr 9) - PPM-D byte mixture confirmed by dexhunter at 1.0322 (PR openai#1857, self-closed) - SmearGate BOS bug documented: prev-token leaks at document boundaries; fix required - PR openai#1848 (newjordan, 0.87980) flagged BPB risk: sibling PR openai#1846 closed same day - PR openai#1858 (0.9946) only covers 8M/40.5M tokens — not leaderboard-comparable - PR openai#1855 (codemath3000, 1.06108) and openai#1851 (aquariouseworkman, 1.06128) both clean - PPM-D wave: PRs openai#1850, openai#1854, openai#1835 await organizer ruling - Added Session 23 lessons to CLAUDE.md - 3 days to deadline (Apr 30) — final GPU run window https://claude.ai/code/session_01RmJtLYUmKNzDgDVTnWoKzU
…olar Express NS + MIN_LR + LQER) Triage of 5 new PRs the user surfaced (1858, 1852, 1855, 1874, 1877): - openai#1852: hard rule violation (pre-quant TTT on validation data). - openai#1858: eval subset (8M of 40.5M tokens), reviewer caught and author admitted. - openai#1877: broken normalization (byte PPM × token NN doesn't sum to 1 over token alphabet), reviewer @sharpobject caught. - openai#1855: techniques mostly legit but apt-get install lrzip violates Issue openai#1017 Rule 3 (artifact must be self-contained). - openai#1874: LEGITIMATE - 3-seed mean 1.06766, std 0.00076, three orthogonal training-time techniques citing prior validated PRs. If it merges, our submission threshold shifts from 1.0760 to ~1.0627. PR openai#1874's three techniques: 1. Polar Express NS coefficients (PR openai#1344) - 5 minimax-tuned tuples replace the fixed (3.4445, -4.775, 2.0315) at MUON_BACKEND_STEPS=5. 2. MIN_LR=0.10 warmdown floor (PR openai#1787) - LR floors at 10% of max instead of decaying to 0. Already wired in our v1+; just env-var opt-in. 3. LQER asymmetric int4 rank-4 quantization correction (PR openai#1797) - SVD on top-K=3 highest-error GPTQ residuals, packed as int4 per-group-64 asymmetric. ~200-400 LOC; deferred to v4. train_gpt_v3.py implements (1) and exposes (2): - POLAR_EXPRESS_NS=0 default (byte-for-byte SOTA when off). - _PE_COEFFS module-level constant + _POLAR_EXPRESS_NS flag read at import time so torch.compile sees them as constants. - zeropower_via_newtonschulz5 branches on _POLAR_EXPRESS_NS to use per-iteration coefficients instead of fixed. - MIN_LR was already an env var; setting MIN_LR=0.10 at runtime opts in. Sizes: v3 raw 54,977 lzma 15,128 (+272 vs v2, +1,880 vs SOTA). Worst- seed artifact slack: ~4,888 bytes under cap. Tight but workable. AST-validated on Python 3.13 (macOS) and 3.12 (Vultr Linux). Stacking projection (single-seed): - Phase 0 baseline: 1.08038 - + LR=0.010 (Stage 2): 1.08021 - + Polar Express NS: 1.0787-1.0797 - + MIN_LR=0.10: 1.0777-1.0794 - + ConfTTT (PR openai#1879): 1.0772-1.0793 - + LQER (v4 work): 1.0742-1.0783 - + Phase 2 architecture: 1.0712-1.0773 - + Newton-Muon Stage E: 1.066-1.075 Path B (absorb-and-stack) recommended over Path A (race-to-merge-with- current-stack) since current stack alone doesn't clear 1.0760. Race awareness: openai#1874, openai#1855 (lrzip-stripped), and openai#1797 are all open. Whichever merges first becomes new SOTA and our threshold tightens. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…mpass Fixes all Copilot + Dex review comments: - Source captured at startup for deterministic bootstrap (20,092 bytes) - TTT reduced to 2 epochs: eval time 350-387s (well under 600s) - Void fraction compass logged as diagnostic (0.510 stable) - 3-seed mean: 0.9946 BPB (std 0.0003), all under 16MB Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Anti-hijack gate: suppress PPM when NN NLL < 0.277 nats (0.40 bits). 3-seed mean: 0.9727 BPB (8M subset), gate_skip ~30.5%. Improved from 0.9946 — gate is both defensive and beneficial. Honest disclosures: - PPM-D evaluated on 8M token subset (noted in val_bpb_note) - Neural-only fallback: 1.0806 BPB (full val) - Issue openai#1872 PPM-D class risk acknowledged explicitly - Not claiming C2 compliance — claiming good-faith engineering Peer reviewed: Tron (number audit), Flynn (gate verify), Lauren (sign-off). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Lead with neural-only 3-seed mean 1.0810 BPB (quantized+TTT). PPM-D 0.9727 moved to experimental section (pending openai#1872). Added cross-platform SDPA verification (1.0886 BPB). Per-seed numbers verified by Tron against run15 gate logs. Peer reviewed: Tron PASS, Flynn PASS, Lauren PASS. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
val_bpb = 1.0810 (3-seed mean, std 0.00037). Artifact under 16MB. 8xH100 SXM.
Submitted while tied for #1 on the leaderboard; bigbag's #1920 has since taken the top at 1.0699. This run stands as a clean neural-only legal entry.
3-seed results (neural-only, score-first TTT)
What's new here
Base
SP8192 + 3-layer depth recurrence + parallel residuals + QK-Gain 5.25.
Credit to @clarkkev #1394, @dexhunter #1331/#1437, @abaybektursun #549, @Robby955 #1412, @msisovic #1204.
How to reproduce
Repeat with SEED=314 and SEED=999 for the 3-seed mean. Each seed runs ~600s train + ~600s eval on 8xH100 SXM. We used the default chunk size 48; bigbag #1920 has since shown a small gain at TTT_CHUNK_SIZE=32, which we'll test in a follow-up.
Compliance
C1 causal, C2 standard softmax over full vocab, C3 score-before-update, C4 single pass. All seeds under 16MB, train <600s, eval <600s.
Experimental (pending #1872)
PPM-D byte mixture + anti-hijack gate hits mix_bpb 0.9727 on the 8M subset. Waiting on the C2 normalization call before claiming it. Either way, the 1.0810 above is the submission. PPM-D port from @anmarhindi #1835.