Non-record: Causal N-gram Logit Blend — Legal, Bug-Free, Null Result at Scale#1642
Open
himanshudongre wants to merge 1 commit intoopenai:mainfrom
Open
Non-record: Causal N-gram Logit Blend — Legal, Bug-Free, Null Result at Scale#1642himanshudongre wants to merge 1 commit intoopenai:mainfrom
himanshudongre wants to merge 1 commit intoopenai:mainfrom
Conversation
…at Scale Rigorous negative result demonstrating that a legal causal n-gram additive-logit blend does not scale to strong models, paired with the first clean reference implementation verified against all three valerio-oai closure rulings (openai#993 hashed caches, openai#1185 full-vocab renormalization, openai#959 two-pass rescoring). Includes: - 8-probe automated legality harness + 4-test integration suite - Scaling curve across 6 model configurations (2L/4L, 128d/256d, 800-4000 steps, sp1024/sp8192) showing peak BPB improvement collapses from 0.0515 (weak baseline) to 0.00018 (strongest model) — well below the 0.0072 BPB record threshold - Localized delta decomposition showing 100% of the gain comes from out-of- attention-window cache hits, and why the sp1024 → sp8192 transition erodes even that architectural floor
51d0391 to
9fa635f
Compare
deborahnelson8788726
pushed a commit
to deborahnelson8788726/parameter-golf
that referenced
this pull request
Apr 22, 2026
Added experimental techniques for Parameter Golf exploration: - LegalNgramMixer (PR openai#1642 compliant N-gram with exact tuple keys and full-vocab distribution) — too slow in Python, timed out on Modal - Lion optimizer for SLOT (Trinity framework technique) — gave 0.71197 on 1xH100 vs 0.72097 for AdamW; marginally better but both worse than v3 - Phi-rank softmax in SLOT eval (Trinity golden-ratio weighting) — worse at 0.81697; 50/50 blend hurts calibrated probabilities - Configurable NGRAM_LEGAL, SLOT_OPTIMIZER, SLOT_PHI_RANK env vars - Modal launch scripts for v4-v7 reproducibility - RunPod training shell script for 8xH100 deployments These are negative/marginal results kept for reproducibility. The clean v3 submission (PR openai#1722, 0.65802 BPB) remains our primary legal record. Added to .gitignore: .secrets/, .obsidian/, cowork_transfer/ Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This is a rigorous negative result paired with the first clean legal reference implementation of an eval-time causal n-gram additive-logit blend — the technique every closed
ngram-titled record PR in this repo was trying to implement.Why this is useful
code/localized_delta.py) that can be applied to any Track B technique to verify where its marginal gain comes from.Legality
This is a non-record research submission. It does not claim a leaderboard position. Code, results, and logs are provided for reproduction of the reported negative result. See README.md for the full per-condition compliance table and the 12/12 passing automated tests (8 legality + 4 integration) on both CPU and CUDA.
No network calls. N-gram state is built entirely from already-scored eval tokens per Track B semantics. Single left-to-right pass. Score-before-update discipline enforced at chunk boundaries.
Test plan
python3 code/legality_harness.py→ 8/8 PASS (CPU + CUDA on A40)python3 code/test_integration.py→ 4/4 PASS (CPU + CUDA on A40)training_logs/)Happy to ping the specific closed PRs (#993, #1026, #1185) to point to this as the legal reference after it lands, if that's useful.