Non-record: Causal N-gram Logit Blend — Legal, Bug-Free, Null Result at Scale by himanshudongre · Pull Request #1642 · openai/parameter-golf

himanshudongre · 2026-04-15T16:27:12Z

Summary

This is a rigorous negative result paired with the first clean legal reference implementation of an eval-time causal n-gram additive-logit blend — the technique every closed ngram-titled record PR in this repo was trying to implement.

Clean legal implementation verified against the specific closure rulings — Record: 11L XSA + Mixed INT6 + Adaptive N-gram Cache (2->7 backoff) - val_bpb=0.9631, 3-seed #993 (hashed caches disallowed), [10min_16mb] 0.9641 BPB: LeakyReLU² + Score-First TTT + N-gram Backoff Cache #1185 (must renormalize over full vocab), Record: Nacrith Log-Bias + Full-Rescore N-gram — val_bpb 0.00000035 (3-seed mean) #959 (two-pass rescoring leaks) — via an 8-probe automated legality harness and 4-test integration suite.
Scaling curve across 6 model configurations (2L/4L, 128d/256d, 800–4000 steps, sp1024/sp8192) showing the peak BPB improvement collapses from 0.0515 on a very weak baseline to 0.00018 on the strongest model tested — ~40× below the 0.0072 BPB record threshold, with peak α shifting from 0.5 → 0.05 as the baseline strengthens.
Localized (bucketed) delta analysis showing 100% of the net benefit comes from cache hits whose context was first seen outside the 2048-token attention window. Section in the README explains why the sp1024 → sp8192 tokenizer transition erodes even that architectural floor.

Why this is useful

Every previous n-gram PR was closed for a specific C1/C2/C3/C4 violation per Issue A Field Guide to Valid Submissions #1017. This one passes all four. The automated legality harness is reusable for future eval-time adaptation techniques.
Saves other participants from running the same experiment and submitting the same variant. The scaling trend is monotonic and leaves no room for a "just train slightly longer" rescue.
Provides a per-bucket delta decomposition tool (code/localized_delta.py) that can be applied to any Track B technique to verify where its marginal gain comes from.

Legality

This is a non-record research submission. It does not claim a leaderboard position. Code, results, and logs are provided for reproduction of the reported negative result. See README.md for the full per-condition compliance table and the 12/12 passing automated tests (8 legality + 4 integration) on both CPU and CUDA.

No network calls. N-gram state is built entirely from already-scored eval tokens per Track B semantics. Single left-to-right pass. Score-before-update discipline enforced at chunk boundaries.

Test plan

python3 code/legality_harness.py → 8/8 PASS (CPU + CUDA on A40)
python3 code/test_integration.py → 4/4 PASS (CPU + CUDA on A40)
Phase 1-A: 4L 256d sp8192 2000 steps alpha sweep → peak Δ 0.00223 BPB at α=0.10 (logs in training_logs/)
Phase 1-B: 4L 256d sp8192 4000 steps α=0.05, 0.10 → peak 0.00018 BPB, hurts at α≥0.10
Full reproduction instructions in README.md

Happy to ping the specific closed PRs (#993, #1026, #1185) to point to this as the legal reference after it lands, if that's useful.

…at Scale Rigorous negative result demonstrating that a legal causal n-gram additive-logit blend does not scale to strong models, paired with the first clean reference implementation verified against all three valerio-oai closure rulings (openai#993 hashed caches, openai#1185 full-vocab renormalization, openai#959 two-pass rescoring). Includes: - 8-probe automated legality harness + 4-test integration suite - Scaling curve across 6 model configurations (2L/4L, 128d/256d, 800-4000 steps, sp1024/sp8192) showing peak BPB improvement collapses from 0.0515 (weak baseline) to 0.00018 (strongest model) — well below the 0.0072 BPB record threshold - Localized delta decomposition showing 100% of the gain comes from out-of- attention-window cache hits, and why the sp1024 → sp8192 transition erodes even that architectural floor

Added experimental techniques for Parameter Golf exploration: - LegalNgramMixer (PR openai#1642 compliant N-gram with exact tuple keys and full-vocab distribution) — too slow in Python, timed out on Modal - Lion optimizer for SLOT (Trinity framework technique) — gave 0.71197 on 1xH100 vs 0.72097 for AdamW; marginally better but both worse than v3 - Phi-rank softmax in SLOT eval (Trinity golden-ratio weighting) — worse at 0.81697; 50/50 blend hurts calibrated probabilities - Configurable NGRAM_LEGAL, SLOT_OPTIMIZER, SLOT_PHI_RANK env vars - Modal launch scripts for v4-v7 reproducibility - RunPod training shell script for 8xH100 deployments These are negative/marginal results kept for reproducibility. The clean v3 submission (PR openai#1722, 0.65802 BPB) remains our primary legal record. Added to .gitignore: .secrets/, .obsidian/, cowork_transfer/ Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

himanshudongre force-pushed the nonrecord/causal-ngram-null-result branch from 51d0391 to 9fa635f Compare April 15, 2026 16:32

himanshudongre mentioned this pull request May 1, 2026

Non-record: competition research notes #2111

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Causal N-gram Logit Blend — Legal, Bug-Free, Null Result at Scale#1642

Non-record: Causal N-gram Logit Blend — Legal, Bug-Free, Null Result at Scale#1642
himanshudongre wants to merge 1 commit intoopenai:mainfrom
himanshudongre:nonrecord/causal-ngram-null-result

himanshudongre commented Apr 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

himanshudongre commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why this is useful

Legality

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

himanshudongre commented Apr 15, 2026 •

edited

Loading