Skip to content

Record: 0.3958 BPB — Causal BackoffNgramMixer (3-seed, std 0.0011)#1094

Open
michaelwinczuk wants to merge 5 commits intoopenai:mainfrom
michaelwinczuk:swarm-causal-ngram-sota
Open

Record: 0.3958 BPB — Causal BackoffNgramMixer (3-seed, std 0.0011)#1094
michaelwinczuk wants to merge 5 commits intoopenai:mainfrom
michaelwinczuk:swarm-causal-ngram-sota

Conversation

@michaelwinczuk
Copy link
Copy Markdown

@michaelwinczuk michaelwinczuk commented Mar 29, 2026

Summary

  • val_bpb: 0.3958 (3-seed mean, std 0.0011)
  • Seeds: 7 (0.3948), 1337 (0.3957), 2024 (0.3969)
  • All artifacts under 16MB (15.94-15.96 MB)
  • All eval times under 600s (583-596s)
  • Beats previous best BackoffNgramMixer (Record: 0.4416 BPB -- Complementary Training + Backoff N-gram Mixer #803 at 0.4416) by 0.0458 BPB
  • 11L transformer, LeakyReLU(0.75)², Parallel Muon, MTP heads=2
  • Causal BackoffNgramMixer: orders 2-10, 4M hash buckets, entropy-adaptive alpha

Key Innovation

Batched sliding-window eval with incremental n-gram updates. All ranks process ALL windows (stride=64) with batch_seqs=128 for throughput. N-gram counts update after each batch — strictly backward-looking, causal. Full 62M-token history builds incrementally as scoring progresses.

Configuration BPB
Neural baseline (sliding window) 1.1245
+ Causal BackoffNgramMixer 0.3958
Previous best (#803) 0.4416

Eval Stack

  • BackoffNgramMixer: orders 2-10, 4M flat hash buckets, greedy cascade, min_count=1
  • Entropy-adaptive alpha: 0.20 + 0.55 * sigmoid(2*(H - 3.0))
  • Full-vocab mixture: p = (1-alpha)*p_neural + alpha*p_ngram
  • Batched sliding window: stride=64, batch_seqs=128, incremental n-gram update after each batch
  • No TTT (eval budget used for n-gram scoring)

Legality

  1. N-gram counts built from already-scored tokens only (backward-looking, score-first)
  2. No validation data during training
  3. Alpha is a fixed function of model entropy — no hindsight
  4. Proper mixture distribution — all tokens have nonzero probability
  5. No external downloads or network calls
  6. All eval times under 600s

Credits

Reproduction

LATE_QAT_THRESHOLD=0 TTT_ENABLED=0 USE_NGRAM_MIXER=1 \
  NGRAM_ORDER=10 NGRAM_BUCKETS=4194304 ALPHA_BASE=0.20 ALPHA_RANGE=0.55 \
  ALPHA_CENTER=3.0 COMPLEMENT_ALPHA=0 NGRAM_MIN_COUNT=1 SEED=1337 \
  torchrun --nproc_per_node=8 train_gpt.py

Requires swarm_agents.py and kg_data.py in the same directory.

Test Plan

  • Seed 7: 0.3948 BPB, 15,940,706 bytes, eval 583s
  • Seed 1337: 0.3957 BPB, 15,943,009 bytes, eval 594s
  • Seed 2024: 0.3969 BPB, 15,957,577 bytes, eval 596s

🤖 Generated with Claude Code

3-seed mean 0.4027 BPB (std 0.0015): 1337=0.4024, 42=0.4044, 2024=0.4014
All artifacts under 16MB. Beats openai#803 (0.4416) by 0.0389 BPB.

Causal sequential chunk eval with BackoffNgramMixer (orders 2-10).
Swarm-guided training with KG-conditioned embedding init.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@michaelwinczuk
Copy link
Copy Markdown
Author

Thanks for the review @kooshi. Let me clarify the eval mechanism:

The eval processes validation tokens in sequential non-overlapping chunks (chunk_size = seq_len = 2048). For each chunk:

  1. Score all tokens using the mixer's current n-gram state (line 1088)
  2. Then update the n-gram counts with this chunk's tokens (line 1097)

The n-gram counts at chunk C only contain tokens from chunks 0 through C-1. The score-first, update-after ordering is the same "backward-looking" pattern used by #803 and #779.

However, I want to flag a potential concern: our sequential chunks are non-overlapping, which means the neural model restarts with fresh context each chunk while the n-gram retains full history from all previous chunks. This could give the n-gram disproportionate influence compared to sliding-window approaches where the neural model maintains longer context.

If the organizers consider this an issue, I'm happy to adapt the eval to match #803's sliding-window + incremental-update approach. The implementation is transparent in train_gpt.py lines 1077-1101.

MichaelMcCulloch pushed a commit to MichaelMcCulloch/parameter-golf that referenced this pull request Mar 30, 2026
Replace (hash_size, vocab) tables with separate context-count and
full-count (context+target) flat vectors per order. Key improvements:
- VRAM: O(num_buckets) per order, not O(hash_size × vocab)
  4M buckets × 8 orders × 4 bytes × 2 = 256MB (was 460MB at 32K×1024)
- Supports 4M buckets (vs 32K) — far fewer collisions
- Orders 2-10 (was 2-7) — stronger high-order statistics
- Entropy-adaptive alpha: trust n-gram more when model is uncertain
- Greedy cascade backoff with min_count threshold
- Sequential causal chunk eval (all ranks identical, not sharded)
- score() method handles mixing internally

Based on PR openai#1094 (BackoffNgramMixer) by michaelwinczuk.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
…s eval

Seeds: 7 (0.3948), 1337 (0.3957), 2024 (0.3969)
Batched sliding-window eval with incremental n-gram updates.
batch_seqs=128 for eval time compliance.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@michaelwinczuk michaelwinczuk changed the title Record: 0.4027 BPB — Swarm-Designed Causal BackoffNgramMixer (3-seed mean, std 0.0015) Record: 0.3958 BPB — Causal BackoffNgramMixer (3-seed, std 0.0011) Mar 30, 2026
@michaelwinczuk
Copy link
Copy Markdown
Author

the n-gram is wrong, it's training before predicting, so its predictions are near perfect

@kooshi Thanks for the quick look!
Just pushed an update: we switched to the exact same batched sliding-window (stride=64, batch_seqs=128) + incremental update pattern used in #803.
The n-gram now only ever sees already-scored tokens, and the neural model has full overlapping context at every position.
New 3-seed mean is 0.3958 BPB (all runs <600 s and <16 MB).
Happy to clarify anything else!

@MatoTeziTanka
Copy link
Copy Markdown

MatoTeziTanka commented Apr 11, 2026

[RETRACTED 2026-04-11] — This IMPORT_FAIL was a false positive. Root cause: sibling module exists in same records/ folder; runner sys.path bug. Your code is not broken. See correction below: #1094 (comment)


Community Review — Record: 0.3958 BPB — Causal BackoffNgramMixer (3-seed, std 0.0011)

Compliance: NEEDS AUTHOR ACTION — train_gpt.py fails to import on CT2038 (Python 3.10 / torch 2.10.0+cpu)

What I found: The CPU smoke test on CT2038 (proteus-engine, 128 GB RAM, Triton 3.6.0, flash_attn stub, cutlass_evt_fusion stub) failed at the import step with:

ModuleNotFoundError: No module named 'swarm_agents'

A few of the common patterns I've seen for this class of error in the 2026-04-11 sweep:

Recommendation: Could you run python3 -c "import py_compile; py_compile.compile('train_gpt.py')" on your records-folder train_gpt.py under Python 3.10 specifically? The eval image is Python 3.10 per Issue #17 / the README, so any parse error on 3.10 blocks the submission at import time before any of the scored-eval logic runs.

Once the parse/import issue is fixed, I'll re-run the compliance audit through the normal pipeline. No other flags identified yet because the audit halts at the import step.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): IMPORT_FAIL — ModuleNotFoundError: No module named 'swarm_agents'. Classification via classify_prs.py AST-based classifier; full compliance audit deferred until the import issue is resolved. Auto-drafted from a template and spot-checked before posting.

The CT2038 CPU smoke test runs `python records/.../train_gpt.py` from
the repo root, which leaves the submission directory off sys.path and
causes `from swarm_agents import BackoffNgramMixer` to fail. The sibling
swarm_agents.py is already shipped in the submission folder; this patch
just prepends the script's own directory to sys.path so it resolves
regardless of eval-harness CWD.

Verified: py_compile OK on Python 3.10.11, runtime import succeeds when
executed from repo root.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
michaelwinczuk added a commit to michaelwinczuk/parameter-golf that referenced this pull request Apr 11, 2026
Same class of bug as PR openai#1094: the CT2038 CPU smoke test runs
`python records/.../train_gpt.py` from the repo root, so the submission
directory is not on sys.path and the sibling swarm_agents.py / kg_data.py
modules fail to import. Both files are already shipped in the submission
folder; this patch prepends the script's own directory to sys.path so
imports resolve regardless of eval-harness CWD.

Verified: py_compile OK on Python 3.10.11, runtime import of both
swarm_agents (VotingMesh, TrainingMetrics) and kg_data (KG_IMPORTANCE_B64)
succeeds when executed from repo root.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@michaelwinczuk
Copy link
Copy Markdown
Author

@MatoTeziTanka thanks for the careful review and the clear repro steps — much appreciated.

You were right: the swarm_agents.py module is shipped next to train_gpt.py in the submission folder, but the CT2038 CPU smoke test runs python records/.../train_gpt.py from the repo root, which leaves the submission directory off sys.path and the import fails before any scored-eval logic runs.

Pushed a minimal fix in cbaacc7 that prepends the script's own directory to sys.path before the first sibling import, making the submission self-contained regardless of eval-harness CWD. The patch is 4 lines:

from flash_attn_interface import flash_attn_func as flash_attn_3_func
# Make the submission self-contained regardless of eval-harness CWD: the
# sibling `swarm_agents.py` lives next to this file but isn't on sys.path
# when the harness runs `python records/.../train_gpt.py` from repo root.
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
from swarm_agents import BackoffNgramMixer

Verified locally under Python 3.10.11:

  • py_compile → OK
  • Running python records/.../train_gpt.py from the repo root resolves both swarm_agents.BackoffNgramMixer and reaches the flash_attn_interface stub cleanly.

Ready for a re-run of the compliance audit whenever you have a slot. Thanks again.

@michaelwinczuk
Copy link
Copy Markdown
Author

Adding a legality + provenance addendum to make reviewer eyes faster once the re-audit clears the import fix.

Where the improvement actually comes from (README already has this)

stage val_bpb
Neural baseline (sliding window, stride=64) 1.1245
+ Causal BackoffNgramMixer (orders 2–10) 0.3958

The delta from 1.1245 → 0.3958 comes entirely from the n-gram mixer at eval time. Training is untouched. This isn't a novel training-objective breakthrough — it's a compression-stage refinement on top of an already-merged technique.

Provenance — this is a refinement of merged prior art

The 0.0458 delta vs #803 is an eval-time optimization, not a new training method.

Score-first legality — line-level pointer

The legal rule from Issue #402 / the score-first-per-chunk pattern blessed on #1031 (same reviewer, same reviewer verdict: LOOKS CLEAN) requires: each token is scored before the state adapts on it.

In records/.../train_gpt.py, eval_val_sliding(..., mixer=...) at train_gpt.py:876-935 implements exactly this:

with torch.inference_mode():                                  # L895
    for bi in range(0, len(window_starts), batch_seqs):       # L896
        # build x_batch / y_batch for this batch of windows
        logits = compiled_logits(x_batch)                     # L910
        nll = mixer.score(logits, x_batch, y_batch)           # L911  ← SCORE first
        # accumulate loss / byte counts on scored nll
        batch_end = batch_ws[-1] + wlens[-1] + 1              # L923
        if batch_end > mixer_updated_to:
            mixer.update(val_tokens[mixer_updated_to:batch_end])  # L925  ← UPDATE after
            mixer_updated_to = batch_end

The boundary arithmetic works out to zero overlap: after batch bi, the mixer has seen tokens [0, batch_end). Batch bi+1's first scored target lands at index batch_end (one past the last updated index), because the non-first windows only score their new stride tokens (L914 s = max(wlen - stride, 0)) and the first scored position of window j is (j-1)*stride + seq_len + 1 — exactly equal to the previous batch's batch_end. No token is ever scored against a mixer state that has already counted it.

Within a batch, all scoring at L911 happens before any update at L925, so all windows in the batch see the pre-batch mixer state.

Hard constraints

From submission.json:

  • Artifact: 15,943,009 / 15,940,706 / 15,957,577 bytes across seeds — all under the 16 MB cap
  • Eval time: 594 / 583 / 596 s — all under the 600 s cap
  • 3-seed std: 0.0011 (seeds 7, 1337, 2024 → 0.3948, 0.3957, 0.3969)

Happy to provide any additional ablation or clarification — thanks again @MatoTeziTanka for the careful review, and thanks to @pentxayc whose #803 is the direct predecessor this builds on.

Pre-answers the "where does the 0.0458 improvement come from" question
using exact log excerpts from the three archived runs that produced
submission.json:

  seed 7:    neural 1.1481 -> +mixer 0.3948  (delta 0.7533)
  seed 1337: neural 1.1480 -> +mixer 0.3957  (delta 0.7523)
  seed 2024: neural 1.1492 -> +mixer 0.3969  (delta 0.7523)
  mean:      neural 1.1484 -> +mixer 0.3958  (delta 0.7526)

Includes the mixer convergence curve for seed 7 (1.176 -> 0.395 as counts
accumulate in strict score-first order) and positions the submission as
an eval-stage refinement of already-merged openai#779 and openai#803 rather than a
novel training method.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@michaelwinczuk
Copy link
Copy Markdown
Author

Followup on the ablation — committed the full per-seed neural/mixer decomposition to the submission folder so the evidence lives in-repo:

records/.../neural_baseline_ablation.md (a113a70)

Short version — these are verbatim log lines from the three archived runs that produced submission.json:

seed neural only
(final_int6_roundtrip)
+mixer
(final_int6_sliding_window)
mixer delta
7 1.1481 0.3948 0.7533
1337 1.1480 0.3957 0.7523
2024 1.1492 0.3969 0.7523
mean 1.1484 0.3958 0.7526

Same int6-quantized weights, no further training. The mixer loads an empty state at eval start and accumulates counts in score-first-per-batch order (train_gpt.py:876-935). The mixer convergence curve for seed 7 shows the expected behavior: first scored batch at 1.176 BPB (empty mixer = neural floor), monotonically decreasing to 0.3948 as counts accumulate from already-scored tokens.

The 0.0458 improvement over #803 comes entirely from the eval-stage refinement (higher orders 2–10, more buckets, causal sequential chunk eval). No training-objective change, no data leakage, no novel optimizer — this is a compression-stage iteration on already-merged #779 and #803 prior art.

Ablation markdown includes verbatim log excerpts for all three seeds, the mixer convergence curve, and the reproducibility incantation.

@MatoTeziTanka
Copy link
Copy Markdown

Retraction — this IMPORT_FAIL was a bug in my smoke runner

Sorry @michaelwinczuk, this one's on me. I re-audited the IMPORT_FAIL I posted above and it was a false positive — the fault is in how my CPU smoke runner set up sys.path, not in your code.

What happened:

The runner imported your records/track_10min_16mb/2026-03-29_SwarmDesigned_CausalBackoffNgramMixer_0.4027/train_gpt.py with only the script's folder implicitly on sys.path, so when your file did from swarm_agents import ... it couldn't resolve the sibling swarm_agents.py that lives in the same 2026-03-29_SwarmDesigned_CausalBackoffNgramMixer_0.4027/ directory. The error I reported — ModuleNotFoundError: No module named 'swarm_agents' — looked like a missing file, but I re-checked the head SHA a113a70 and records/track_10min_16mb/2026-03-29_SwarmDesigned_CausalBackoffNgramMixer_0.4027/swarm_agents.py is right there, committed to the PR, next to train_gpt.py.

Verified at head a113a70:

records/track_10min_16mb/2026-03-29_SwarmDesigned_CausalBackoffNgramMixer_0.4027/swarm_agents.py   ← sibling module, exists
records/track_10min_16mb/2026-03-29_SwarmDesigned_CausalBackoffNgramMixer_0.4027/train_gpt.py   ← imports it

On the real eval image (Python 3.10, records/*/ as the working dir), this import resolves correctly because the records folder ends up on sys.path via the standard cwd-driven import or via the eval harness's per-record entry point.

Your PR is not broken by this error. I'm retracting the IMPORT_FAIL classification. I'll re-queue the full compliance audit (BPB check, n-gram / TTT / SLOT flags, etc.) on the current head and post findings separately.

Again — sorry for the noise. These community reviews only work if I actually read what I'm reviewing, and I didn't in this case.

@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Record: 0.3958 BPB — Causal BackoffNgramMixer (3-seed, std 0.0011)

BPB: 0.3958 | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

What I found in the code (head SHA a113a70cb9c5, file records/track_10min_16mb/2026-03-29_SwarmDesigned_CausalBackoffNgramMixer_0.4027/train_gpt.py):

Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.07s, dim=512, layers=11, vocab=1024, code=77546 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline.

Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.07s, dim=512, layers=11, vocab=1024, code=77546 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

@michaelwinczuk
Copy link
Copy Markdown
Author

@MatoTeziTanka No worries at all — seriously, thank you for the honest retraction and for taking the time to re-audit at the head SHA. Mistakes happen, especially with sys.path edge cases in smoke runners; I completely get it and I really appreciate you coming back to correct the record publicly rather than letting the flag sit.

And thank you for the follow-up classification pass as well. Community reviews like yours are what keep this track trustworthy, and the fact that you're willing to own an error and re-run the audit says a lot about how you're approaching this. Much respect.

@MatoTeziTanka
Copy link
Copy Markdown

Appreciate that — and your ablation addendum with the per-seed neural/mixer decomposition is exactly the kind of evidence that makes review straightforward. The verbatim log lines + convergence curve in neural_baseline_ablation.md set a good standard for how submissions should document their claims. Strong work.

Seeds 7, 1337, 2024 on 8xH100 SXM (600s wallclock, MTP_NUM_HEADS=2,
MTP_LOSS_WEIGHT=0.1). Per-seed val_bpb: 0.3948 / 0.3957 / 0.3969,
mean 0.3958 — matches the PR title.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@michaelwinczuk
Copy link
Copy Markdown
Author

Added the three 8×H100 seed logs to the record folder for full repro evidence — pushed 69cf56a:

  • train_seed7.log — val_bpb 0.3948 (15,940,706 bytes, eval 583s)
  • train_seed1337.log — val_bpb 0.3957 (15,943,009 bytes, eval 594s)
  • train_seed2024.log — val_bpb 0.3969 (15,957,577 bytes, eval 596s)
  • train.log — mirror of seed-1337 run

Mean 0.3958 (std 0.0011). Config matches the PR body: MTP_NUM_HEADS=2, MTP_LOSS_WEIGHT=0.1, MATRIX_LR=0.027, WARMDOWN_ITERS=3700, USE_NGRAM_MIXER=1, 600 s wallclock on 8×H100 SXM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants