10L + Multi-Order N-gram Backoff (0.9123 BPB) by Bortlesboat · Pull Request #802 · openai/parameter-golf

Bortlesboat · 2026-03-26T03:25:40Z

Record submission

val_bpb: 0.9123 (mean of 3 seeds, post int5/int6+zstd quantization roundtrip)

Seed	val_bpb	artifact_bytes
42	0.9128	15,320,000
1337	0.9121	15,630,000
2024	0.9121	15,330,000

Architecture

10 layers, d=512, GQA 8H/4KV, LeakyReLU(0.5)^2
Partial RoPE (16/64), LN Scale, XSA last 4, Value Residual
BigramHash(4096, dim=128), SmearGate, U-Net skips
Mixed int5 MLP / int6 attention + zstd-22
EMA(0.997), Muon WD=0.04, warmdown=3500

Eval: Multi-Order N-gram Backoff + Entropy-Adaptive Alpha

Hashed n-gram cache, orders 2 through 7 with backoff
Highest matching order wins (7-gram preferred, falls back to lower)
Entropy-adaptive alpha: alpha = 0.05 + 0.55 * sigmoid(2 * (H - 4.0))
Score-first: cache updated only AFTER scoring each segment
4M hash buckets per order, min_count=2

Timing (8xH100 SXM)

Training: 600s (~6020 steps at 99ms/step)
Eval: ~163s (sliding window stride=64, batch_seqs=64)

Based on

thwu1's 10L Int5-MLP base architecture
PR Record: First Legal Sub-1.0 BPB — Multi-order N-gram Backoff + Entropy-Adaptive Alpha (val_bpb=0.9674, 3-seed) #727 (multi-order n-gram backoff concept)
PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549 (LeakyReLU^2)
PR Record: 11L XSA + EMA + Int6 MLP3x + WD=0.04 (val_bpb: 1.1271) #287 (XSA, Partial RoPE, LN Scale)

Explores stacking eval-time techniques (neural cache, LoRA TTT) and quantization-aware training on top of the openai#1 recipe. QAT has an export mismatch bug resulting in high quantization penalty — submitting as non-record to document the approach for iteration.

Non-record submission. 10 layers, d=512, GQA 8H/4KV, mixed int5/int6 quantization + zstd-22. BigramHash(4096, dim=128), SmearGate, SWA(0.4). Mean of 3 seeds: 1.1507 +/- 0.0006 BPB. All artifacts under 16MB.

10L d=512, GQA 8H/4KV, LeakyReLU(0.5)^2, Partial RoPE, LN Scale, XSA last 4, Value Residual, EMA(0.997). Mixed int5/int6 + zstd-22. Eval: multi-order hashed n-gram backoff (orders 2-7) with entropy- adaptive alpha. Mean of 3 seeds: 0.9123 +/- 0.0003 BPB.

Renamed to reflect actual technique (n-gram backoff + entropy alpha). Removed old 1.1507 BPB seed logs. Added explicit compliance/legality section per competition conventions.

Single change from PR openai#802: MATRIX_LR=0.03 (was 0.02). Discovered through systematic screening (74 experiments, steps 10-12). - 10L, 512d, GQA 8/4, LeakyReLU(0.5)², BigramHash 4096 - Multi-order n-gram backoff eval cache (orders 2-7) - Entropy-adaptive alpha mixing (score-first, legal) - 8xH100 SXM, 600s training, 138s eval - Artifact: 15.32 MB Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Bortlesboat · 2026-03-27T00:25:58Z

Superseded by PR #876 (0.5863 BPB) and PR #912 (0.3461 BPB with PPM full-rescore).

Bortlesboat · 2026-04-04T23:26:04Z

Closing - uses n-gram backoff which is out of scope for this track.

…s 2007) THE biggest legal technique gap after LEGAL_TTT. Top 30 legal PRs in COMPETITION_SCOPE.md all use multi-order n-gram backoff (openai#788/openai#802/openai#828/openai#761 = 0.91-0.96 BPB). Implementation: at each position, use the HIGHEST-CONFIDENCE n-gram order ONLY: - if peak(4-gram[h]) > T4: use 4-gram with weight 1.0 - elif peak(3-gram[h]) > T3: use 3-gram with weight α=0.4 (Brants 2007) - else: use bigram with weight α²=0.16 The 'peak' = max log-prob across vocab — concentrated distributions = confident counts. Hash-collision noise in lower orders is stripped by using only the most-confident order. Marker: NGRAM_BACKOFF_MARKER. Env: USE_NGRAM_BACKOFF=1, NGRAM_BACKOFF_THRESH4=1.0, NGRAM_BACKOFF_THRESH3=1.0, NGRAM_BACKOFF_ALPHA=0.4. Composes with NGRAM_GATE. Smoke test in /tmp passes: marker present in patched file, syntax-valid Python. EXPECTED_MARKERS now 46 (was 45). Queued L09_ngram_backoff_S2_seed42/seed1337 on Pod C for n=2 cheap-pod validation.

Bortlesboat added 3 commits March 20, 2026 23:10

records: 10L Int5-MLP + BigramHash(4096) + SWA (1.1507 BPB)

345f145

Non-record submission. 10 layers, d=512, GQA 8H/4KV, mixed int5/int6 quantization + zstd-22. BigramHash(4096, dim=128), SmearGate, SWA(0.4). Mean of 3 seeds: 1.1507 +/- 0.0006 BPB. All artifacts under 16MB.

Bortlesboat mentioned this pull request Mar 26, 2026

10L Int5-MLP + BigramHash(4096) + SWA (1.1507 BPB) #694

Closed

cleanup: rename folder, remove stale logs, add compliance section

e5a2377

Renamed to reflect actual technique (n-gram backoff + entropy alpha). Removed old 1.1507 BPB seed logs. Added explicit compliance/legality section per competition conventions.

This was referenced Mar 26, 2026

Record: 0.9076 BPB — 10L + N-gram Backoff + Matrix LR 0.03 #828

Closed

Record: 0.1582 BPB — Learned Mixer Head + No TTT + Matrix LR 0.03 #859

Closed

Bortlesboat closed this Mar 27, 2026

Bortlesboat reopened this Mar 27, 2026

Bortlesboat closed this Apr 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

10L + Multi-Order N-gram Backoff (0.9123 BPB)#802

10L + Multi-Order N-gram Backoff (0.9123 BPB)#802
Bortlesboat wants to merge 4 commits intoopenai:mainfrom
Bortlesboat:submission/v6-ngram-backoff

Bortlesboat commented Mar 26, 2026

Uh oh!

Bortlesboat commented Mar 27, 2026

Uh oh!

Bortlesboat commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Bortlesboat commented Mar 26, 2026

Record submission

Architecture

Eval: Multi-Order N-gram Backoff + Entropy-Adaptive Alpha

Timing (8xH100 SXM)

Based on

Uh oh!

Bortlesboat commented Mar 27, 2026

Uh oh!

Bortlesboat commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant