Record: SP8192 + BigramHash d=32 + Path A v3 passthrough quantization — val_bpb 1.07882 (3-seed mean) by himanshudongre · Pull Request #1716 · openai/parameter-golf

himanshudongre · 2026-04-18T09:33:06Z

Summary

SP8192 stack with two tuning changes on top of the merged 2026-04-09 SP8192 SOTA (@bigbag, 1.08100):

BigramHashEmbedding d = 32 (vs common d=48 / d=64)
Path A v3 aggressive passthrough quantization: per-tensor int8 for five control-tensor families (attn_scale, mlp_scale, resid_mix, skip_gates, skip_weights) and per-row int8 for three small 2-D matrices (bigram.proj, attn_gate_proj, smear_gate.weight) that are otherwise left as fp16 passthrough. Combined with an LZMA self-extracting code wrapper, the full int8 token-embedding + int6 matrix recipe now fits ≤ 16 MB.

Results

Seed	Post-EMA	Quant roundtrip	Sliding	TTT	Artifact (B)
42	1.08584	1.09678	1.08015	1.07887	15,991,203
314	1.08580	1.09679	1.08024	1.07893	15,994,170
999	1.08561	1.09662	1.07998	1.07866	15,996,103
mean	1.08575	1.09673	1.08012	1.07882	15,993,825
std				0.000143

Merged SOTA (2026-04-09 @bigbag): 1.08100 bpb. Delta: −0.00218 bpb = −0.00564 nats/token.

Statistical significance (one-sided z-test vs 0.005-nat threshold)

Our mean bpb: 1.078817
Threshold bpb to clear (at our bpb/val_loss ratio of 0.38716): 1.079064
SEM: 0.0000826 bpb
Z = −2.998, p = 0.00136 (p < 0.01 required) ✅

Compliance (per Issue #1017)

C1 Causality ✅ strictly causal forward; sliding-window eval never references future tokens
C2 Normalized ✅ standard softmax over full 8192 vocab
C3 Score-before-update ✅ each TTT chunk scored under inference_mode() before any parameter update
C4 Single pass ✅ each val token scored exactly once

Additional: no SLOT, no pre-quant TTT on val, no ETLB, no n-gram cache, seeds match @bigbag's convention (42, 314, 999), artifact < 16 MB on all seeds, training ≤ 588 s, eval ≤ 481 s.

Files

README.md — full writeup with tables, compliance, statistical evidence, reproduction
submission.json — structured metadata (per-seed results, compliance, attribution)
train_gpt.py — 18,097-byte LZMA self-extracting submission
train_gpt_stacked_v2_fixed.py — unpacked reference source (53,514 B) for reviewability
train_seed{42,314,999}.log — full training logs

Attribution

@clarkkev — PR Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean) #1394: SP8192 base stack, GPTQ SDClip, int6 matrices / int8 embeddings, MuonEq-R, SP8192 tokenizer recipe
@bigbag — 2026-04-09 SP8192 record: 3-layer depth recurrence + parallel residuals + QK-Gain 5.25 + legal TTT (direct ancestor of this submission)
@dexhunter — PR Record: MuonEq-R + 3-Layer Recurrence + WD=0.095 + MLR=0.022 + All-Int6 — val_bpb 1.0900 (3-seed mean) #1331, Record: SP8192 + Parallel Residuals + 3-Layer Recurrence + Token-Only N-gram Tilt — val_bpb 1.08091 (5-seed mean, causal-corrected) #1437: 3-layer depth recurrence; PR Record: SP8192 + QK-Gain 5 + Legal Score-First TTT — val_bpb 1.08279 (3-seed mean) #1413: legal TTT on SP8192
@Robby955 — PR Record: SP8192 + Parallel Residuals + Hessian-Aware SDClip — val_bpb 1.08354 (3-seed mean) #1412: parallel residuals; @msisovic — PR Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA #1204: parallel residuals concept
@Christopher-Lee-McClendon — PR Non-record: 11L Depth Recurrence + High-Yield Legal TTT (1.14458 BPB) #461: legal score-first TTT framework; @abaybektursun — PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549: merged precedent
@MarioPaerle — PR RECORD: SmearGate + Attention Output Gate + Legal TTT | val_bpb=1.07139 #1667: AttnOutputGate

Record path: records/track_10min_16mb/2026-04-18_SP8192_BigramHash32_PathAv3/

… — val_bpb 1.07882 (3-seed mean) SP8192 stack with two tuning changes: BigramHashEmbedding dimension d=32 and Path A v3 aggressive int8 passthrough quantization (control tensors + small 2-D matrices). 3-seed mean 1.07882 bpb (std 0.000143, seeds 42/314/999). Beats the merged 2026-04-09 SP8192 SOTA (1.08100) by -0.00218 bpb = -0.00564 nats/token, clearing the 0.005-nat threshold at p = 0.00136 (one-sided z = -3.00). All 3 artifacts fit under 16 MB (margins 8.8 KB / 5.8 KB / 3.9 KB). Training 588 s on all seeds; eval ~480 s on all seeds. Full C1-C4 compliance with Issue openai#1017. No SLOT, no ETLB, no pre-quant TTT, no n-gram cache.

…companion to PR openai#1716) Structured documentation of the ablation path behind my record PR openai#1716 (3-seed mean 1.07882 bpb). Covers what was tested, what worked, what was killed, and the architectural reason behind each null result. TL;DR: - CONFIRMED: BIGRAM_DIM=32 + Path A v3 aggressive passthrough quantization (int8 control tensors + int8 small matrices + LZMA code wrapper) fits 15.99 MB at 0 bpb cost; this is the record mechanism. - NULL: TTT_EPOCHS=4 (Δ -0.00009 bpb, saturated), EVAL_SEQ_LEN=4096 + stride=128 (sliding +0.00555 due to OOD scored-tail positions), SWA w=1024 training 2x2 factorial (all configs strictly worse than baseline), TRAIN_SEQ_LEN=4096 direct training (position-depth-vs-breadth tradeoff: pre-quant better but sliding +0.00435), QAT v3 (pre-quant +0.015, TTT diverged to 1.48 via QAT x score-first-TTT lattice interaction), Adaptive Hadamard GPTQ (null on Muon sub-Gaussian weights). Structural finding: every attempt to exploit the eval-time working-memory asymmetry via the absolute-position RoPE architecture fails because sliding specifically measures position-depth at a narrow tail-band. The architecturally correct fix is relative-position attention (ALiBi / NoPE), not more aggressive positional extension. That is the next experiment. Includes 10 logs, 5 patches, 2 prototype modules, and full mechanism analysis for each null so the next person exploring this branch does not re-learn the same lessons at full training cost.

…verlay Supersedes the 2026-04-18 post-Ultrareview pin (876bb36). Rev 5 adds provenance auto-capture, repo-type=model fix, exact-SHA env-var pin, and run_all.sh/README alignment; new pin reflects the pipeline-patch commit. Also records the live-guidance absolute-BPB overlay and 04b deprecation driven by open-PR competitive intel (openai#1700 / openai#1716 / openai#1707 / openai#1693). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The first W94 replay was not a faithful openai#1716 reproduction because the packed wrapper did not expose its intended vocab/data defaults to the evaluator's surface detection. That made the remote launcher fall back to the sp1024 data path. This patch sets the SP8192 defaults explicitly in the wrapper without changing the packed payload. Constraint: Round33 is validating openai#1716, not a launcher-derived sp1024 variant Rejected: Leave autodetection to the launcher | packed wrappers hide VOCAB_SIZE from the regex path Confidence: high Scope-risk: narrow Directive: Any packed submission replay should expose VOCAB_SIZE/DATA_PATH/TOKENIZER_PATH in the wrapper if we rely on evaluator autodetect Tested: python3 -m py_compile train_gpt.py evaluate.py remote_helper.py Not-tested: remote end-to-end score after relaunch

…1716) Two orthogonal training-time levers queued behind spec 011: - bpb-weighted-loss.md (port openai#1519): weight CE by UTF-8 bytes per token. Aligns training objective with eval metric. Risk: SP8192 vocab destabilization (author warns on large vocabs) + CaseOps byte LUT accounting (~1hr of careful code). - bigram-hash-embed.md (port openai#1716): 16384×32 hash-table bigram embed added to token embedding pre-block-0. ~540K params / ~400KB artifact. openai#1736 genuinely lacks this despite prevalence in competitive lineages. Recommended sequencing: 011 → 012 (QK) → 013 (BigramHash, lower risk) → 014 (BPB-weighted, higher risk). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

110 LOC pure addition to train_gpt.py, fully env-gated by BIGRAM_HASH_ENABLED=0/1. Default-off invariant: with env unset the forward pass, state_dict, and optimizer param list are byte-identical to baseline. Components: - BigramHashEmbedding(nn.Module): embed(buckets, dim) + CastedLinear proj(dim, model_dim). proj._zero_init=True -> identity at step 0. Hash: ((prime_a * curr) ^ (prime_b * prev)) % buckets. Position-0 fallback: prev = curr (self-bigram). Cross-doc leakage not special cased, matching openai#1736's SmearGate convention. - GPT.__init__: creates self.bigram_embed when enabled else None. - forward_logits + forward_ttt: additive merge of bigram(input_ids) to tok_emb(input_ids) before SmearGate. attr-guarded. - Optimizers: embed.weight -> AdamW optimizer_tok (embed_wd), proj.weight -> Muon matrix_params. - GPTQ hessian hooks: bigram_embed.embed output -> (dim,dim) hessian; bigram_embed.proj input -> (dim,dim) hessian (proj is <=65536 numel so fp16 passthrough; harmless hook). - Startup log line echoing config. Sizing: 16384*32 int6 embed ~= 393KB. 512*32 fp16 proj = 32KB. Total ~425KB added to artifact; budget dry-run needed before launch. Env vars (defaults): BIGRAM_HASH_ENABLED=0, BIGRAM_HASH_BUCKETS=16384, BIGRAM_HASH_DIM=32, BIGRAM_HASH_PRIME_A=36313, BIGRAM_HASH_PRIME_B=27191. Bug lesson learned from exp/training-bundle commit 8d54854: when Edit's old_string only captures part of a for-loop body, trailing loop statements get pushed outside the loop and may be absorbed by nearby conditional blocks. This patch is a pure prepend/append style (no splits of existing blocks) so that failure mode is avoided. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The worker launcher was silently dropping the env surface needed by the openai#1716 record line, and packed-vocab detection still misread plain-string base85 wrappers as sp1024. This made the node-A repro fall back to defaults instead of the claimed SP8192 BigramHash32 + gate + TTT recipe. Constraint: Remote jobs must keep using the image-owned runtime while still honoring record-specific env knobs. Rejected: Hardcode openai#1716 inside train_gpt.py | would make the worker repo less reusable for the next frontier replay. Confidence: high Scope-risk: narrow Directive: Keep remote_helper packed-wrapper detection aligned with evaluate.py data-setup logic; otherwise faithful repros silently drift. Tested: py_compile for evaluate.py and remote_helper.py; remote_helper detect-vocab returns 8192 on the packed openai#1716 wrapper; _make_job_command includes VOCAB_SIZE/TRAIN_SHARDS_OVERRIDE override path. Not-tested: Full remote run after env forwarding change.

Own-delta experiments started tuning test-time training, but the worker launcher only forwarded the boolean TTT flag and silently dropped TTT_LR / TTT_EPOCHS. That made a supposed TTT-lr sweep collapse back to the faithful baseline. Constraint: We need to iterate on openai#1716-derived own deltas without rewriting train_gpt.py per experiment. Rejected: Hardcode TTT_LR edits directly into the packed training wrapper | too slow and brittle for quick knob sweeps. Confidence: high Scope-risk: narrow Directive: Any launcher-driven experiment knob must be added to FORWARDED_JOB_ENV_KEYS, or the run may silently fall back to the parent surface. Tested: py_compile on evaluate.py. Not-tested: remote launch after this commit (will be verified by the next run).

himanshudongre mentioned this pull request Apr 18, 2026

Non-record: Eval-time lever ablations on SP8192 absolute-RoPE stack (companion to PR #1716) #1718

Open

This was referenced May 1, 2026

Non-record: final frontier autopsy #2110

Open

Non-record: competition research notes #2111

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: SP8192 + BigramHash d=32 + Path A v3 passthrough quantization — val_bpb 1.07882 (3-seed mean)#1716

Record: SP8192 + BigramHash d=32 + Path A v3 passthrough quantization — val_bpb 1.07882 (3-seed mean)#1716
himanshudongre wants to merge 1 commit intoopenai:mainfrom
himanshudongre:record/2026-04-18-sp8192-bigram32-pathav3

himanshudongre commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

himanshudongre commented Apr 18, 2026

Summary

Results

Statistical significance (one-sided z-test vs 0.005-nat threshold)

Compliance (per Issue #1017)

Files

Attribution

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant