Skip to content

Record: SP8192 + BigramHash d=32 + Path A v3 passthrough quantization — val_bpb 1.07882 (3-seed mean)#1716

Open
himanshudongre wants to merge 1 commit intoopenai:mainfrom
himanshudongre:record/2026-04-18-sp8192-bigram32-pathav3
Open

Record: SP8192 + BigramHash d=32 + Path A v3 passthrough quantization — val_bpb 1.07882 (3-seed mean)#1716
himanshudongre wants to merge 1 commit intoopenai:mainfrom
himanshudongre:record/2026-04-18-sp8192-bigram32-pathav3

Conversation

@himanshudongre
Copy link
Copy Markdown

Summary

SP8192 stack with two tuning changes on top of the merged 2026-04-09 SP8192 SOTA (@bigbag, 1.08100):

  1. BigramHashEmbedding d = 32 (vs common d=48 / d=64)
  2. Path A v3 aggressive passthrough quantization: per-tensor int8 for five control-tensor families (attn_scale, mlp_scale, resid_mix, skip_gates, skip_weights) and per-row int8 for three small 2-D matrices (bigram.proj, attn_gate_proj, smear_gate.weight) that are otherwise left as fp16 passthrough. Combined with an LZMA self-extracting code wrapper, the full int8 token-embedding + int6 matrix recipe now fits ≤ 16 MB.

Results

Seed Post-EMA Quant roundtrip Sliding TTT Artifact (B)
42 1.08584 1.09678 1.08015 1.07887 15,991,203
314 1.08580 1.09679 1.08024 1.07893 15,994,170
999 1.08561 1.09662 1.07998 1.07866 15,996,103
mean 1.08575 1.09673 1.08012 1.07882 15,993,825
std 0.000143

Merged SOTA (2026-04-09 @bigbag): 1.08100 bpb. Delta: −0.00218 bpb = −0.00564 nats/token.

Statistical significance (one-sided z-test vs 0.005-nat threshold)

  • Our mean bpb: 1.078817
  • Threshold bpb to clear (at our bpb/val_loss ratio of 0.38716): 1.079064
  • SEM: 0.0000826 bpb
  • Z = −2.998, p = 0.00136 (p < 0.01 required) ✅

Compliance (per Issue #1017)

  • C1 Causality ✅ strictly causal forward; sliding-window eval never references future tokens
  • C2 Normalized ✅ standard softmax over full 8192 vocab
  • C3 Score-before-update ✅ each TTT chunk scored under inference_mode() before any parameter update
  • C4 Single pass ✅ each val token scored exactly once

Additional: no SLOT, no pre-quant TTT on val, no ETLB, no n-gram cache, seeds match @bigbag's convention (42, 314, 999), artifact < 16 MB on all seeds, training ≤ 588 s, eval ≤ 481 s.

Files

  • README.md — full writeup with tables, compliance, statistical evidence, reproduction
  • submission.json — structured metadata (per-seed results, compliance, attribution)
  • train_gpt.py — 18,097-byte LZMA self-extracting submission
  • train_gpt_stacked_v2_fixed.py — unpacked reference source (53,514 B) for reviewability
  • train_seed{42,314,999}.log — full training logs

Attribution

Record path: records/track_10min_16mb/2026-04-18_SP8192_BigramHash32_PathAv3/

… — val_bpb 1.07882 (3-seed mean)

SP8192 stack with two tuning changes: BigramHashEmbedding dimension d=32 and
Path A v3 aggressive int8 passthrough quantization (control tensors + small
2-D matrices). 3-seed mean 1.07882 bpb (std 0.000143, seeds 42/314/999).

Beats the merged 2026-04-09 SP8192 SOTA (1.08100) by -0.00218 bpb =
-0.00564 nats/token, clearing the 0.005-nat threshold at p = 0.00136
(one-sided z = -3.00).

All 3 artifacts fit under 16 MB (margins 8.8 KB / 5.8 KB / 3.9 KB).
Training 588 s on all seeds; eval ~480 s on all seeds. Full C1-C4
compliance with Issue openai#1017. No SLOT, no ETLB, no pre-quant TTT, no
n-gram cache.
himanshudongre added a commit to himanshudongre/parameter-golf that referenced this pull request Apr 18, 2026
…companion to PR openai#1716)

Structured documentation of the ablation path behind my record PR openai#1716 (3-seed mean
1.07882 bpb). Covers what was tested, what worked, what was killed, and the
architectural reason behind each null result.

TL;DR:
- CONFIRMED: BIGRAM_DIM=32 + Path A v3 aggressive passthrough quantization (int8
  control tensors + int8 small matrices + LZMA code wrapper) fits 15.99 MB at 0
  bpb cost; this is the record mechanism.
- NULL: TTT_EPOCHS=4 (Δ -0.00009 bpb, saturated), EVAL_SEQ_LEN=4096 + stride=128
  (sliding +0.00555 due to OOD scored-tail positions), SWA w=1024 training 2x2
  factorial (all configs strictly worse than baseline), TRAIN_SEQ_LEN=4096
  direct training (position-depth-vs-breadth tradeoff: pre-quant better but
  sliding +0.00435), QAT v3 (pre-quant +0.015, TTT diverged to 1.48 via
  QAT x score-first-TTT lattice interaction), Adaptive Hadamard GPTQ (null on
  Muon sub-Gaussian weights).

Structural finding: every attempt to exploit the eval-time working-memory
asymmetry via the absolute-position RoPE architecture fails because sliding
specifically measures position-depth at a narrow tail-band. The architecturally
correct fix is relative-position attention (ALiBi / NoPE), not more aggressive
positional extension. That is the next experiment.

Includes 10 logs, 5 patches, 2 prototype modules, and full mechanism analysis
for each null so the next person exploring this branch does not re-learn the
same lessons at full training cost.
amrayach added a commit to amrayach/parameter-golf that referenced this pull request Apr 18, 2026
…verlay

Supersedes the 2026-04-18 post-Ultrareview pin (876bb36). Rev 5 adds
provenance auto-capture, repo-type=model fix, exact-SHA env-var pin, and
run_all.sh/README alignment; new pin reflects the pipeline-patch commit.

Also records the live-guidance absolute-BPB overlay and 04b deprecation
driven by open-PR competitive intel (openai#1700 / openai#1716 / openai#1707 / openai#1693).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 19, 2026
The first W94 replay was not a faithful openai#1716 reproduction because the packed
wrapper did not expose its intended vocab/data defaults to the evaluator's
surface detection. That made the remote launcher fall back to the sp1024 data
path. This patch sets the SP8192 defaults explicitly in the wrapper without
changing the packed payload.

Constraint: Round33 is validating openai#1716, not a launcher-derived sp1024 variant
Rejected: Leave autodetection to the launcher | packed wrappers hide VOCAB_SIZE from the regex path
Confidence: high
Scope-risk: narrow
Directive: Any packed submission replay should expose VOCAB_SIZE/DATA_PATH/TOKENIZER_PATH in the wrapper if we rely on evaluator autodetect
Tested: python3 -m py_compile train_gpt.py evaluate.py remote_helper.py
Not-tested: remote end-to-end score after relaunch
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
…1716)

Two orthogonal training-time levers queued behind spec 011:

- bpb-weighted-loss.md (port openai#1519): weight CE by UTF-8 bytes per token.
  Aligns training objective with eval metric. Risk: SP8192 vocab
  destabilization (author warns on large vocabs) + CaseOps byte LUT
  accounting (~1hr of careful code).

- bigram-hash-embed.md (port openai#1716): 16384×32 hash-table bigram embed
  added to token embedding pre-block-0. ~540K params / ~400KB artifact.
  openai#1736 genuinely lacks this despite prevalence in competitive lineages.

Recommended sequencing: 011 → 012 (QK) → 013 (BigramHash, lower risk)
→ 014 (BPB-weighted, higher risk).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
110 LOC pure addition to train_gpt.py, fully env-gated by
BIGRAM_HASH_ENABLED=0/1. Default-off invariant: with env unset the
forward pass, state_dict, and optimizer param list are byte-identical
to baseline.

Components:
- BigramHashEmbedding(nn.Module): embed(buckets, dim) + CastedLinear
  proj(dim, model_dim). proj._zero_init=True -> identity at step 0.
  Hash: ((prime_a * curr) ^ (prime_b * prev)) % buckets. Position-0
  fallback: prev = curr (self-bigram). Cross-doc leakage not special
  cased, matching openai#1736's SmearGate convention.
- GPT.__init__: creates self.bigram_embed when enabled else None.
- forward_logits + forward_ttt: additive merge of bigram(input_ids)
  to tok_emb(input_ids) before SmearGate. attr-guarded.
- Optimizers: embed.weight -> AdamW optimizer_tok (embed_wd), proj.weight
  -> Muon matrix_params.
- GPTQ hessian hooks: bigram_embed.embed output -> (dim,dim) hessian;
  bigram_embed.proj input -> (dim,dim) hessian (proj is <=65536 numel
  so fp16 passthrough; harmless hook).
- Startup log line echoing config.

Sizing: 16384*32 int6 embed ~= 393KB. 512*32 fp16 proj = 32KB.
Total ~425KB added to artifact; budget dry-run needed before launch.

Env vars (defaults): BIGRAM_HASH_ENABLED=0, BIGRAM_HASH_BUCKETS=16384,
BIGRAM_HASH_DIM=32, BIGRAM_HASH_PRIME_A=36313, BIGRAM_HASH_PRIME_B=27191.

Bug lesson learned from exp/training-bundle commit 8d54854: when Edit's
old_string only captures part of a for-loop body, trailing loop
statements get pushed outside the loop and may be absorbed by nearby
conditional blocks. This patch is a pure prepend/append style (no
splits of existing blocks) so that failure mode is avoided.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 20, 2026
The worker launcher was silently dropping the env surface needed by the openai#1716 record line, and packed-vocab detection still misread plain-string base85 wrappers as sp1024. This made the node-A repro fall back to defaults instead of the claimed SP8192 BigramHash32 + gate + TTT recipe.

Constraint: Remote jobs must keep using the image-owned runtime while still honoring record-specific env knobs.
Rejected: Hardcode openai#1716 inside train_gpt.py | would make the worker repo less reusable for the next frontier replay.
Confidence: high
Scope-risk: narrow
Directive: Keep remote_helper packed-wrapper detection aligned with evaluate.py data-setup logic; otherwise faithful repros silently drift.
Tested: py_compile for evaluate.py and remote_helper.py; remote_helper detect-vocab returns 8192 on the packed openai#1716 wrapper; _make_job_command includes VOCAB_SIZE/TRAIN_SHARDS_OVERRIDE override path.
Not-tested: Full remote run after env forwarding change.
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 20, 2026
Own-delta experiments started tuning test-time training, but the worker launcher only forwarded the boolean TTT flag and silently dropped TTT_LR / TTT_EPOCHS. That made a supposed TTT-lr sweep collapse back to the faithful baseline.

Constraint: We need to iterate on openai#1716-derived own deltas without rewriting train_gpt.py per experiment.
Rejected: Hardcode TTT_LR edits directly into the packed training wrapper | too slow and brittle for quick knob sweeps.
Confidence: high
Scope-risk: narrow
Directive: Any launcher-driven experiment knob must be added to FORWARDED_JOB_ENV_KEYS, or the run may silently fall back to the parent surface.
Tested: py_compile on evaluate.py.
Not-tested: remote launch after this commit (will be verified by the next run).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant