Record: SP8192 + BigramHash d=32 + Path A v3 passthrough quantization — val_bpb 1.07882 (3-seed mean)#1716
Open
himanshudongre wants to merge 1 commit intoopenai:mainfrom
Conversation
… — val_bpb 1.07882 (3-seed mean) SP8192 stack with two tuning changes: BigramHashEmbedding dimension d=32 and Path A v3 aggressive int8 passthrough quantization (control tensors + small 2-D matrices). 3-seed mean 1.07882 bpb (std 0.000143, seeds 42/314/999). Beats the merged 2026-04-09 SP8192 SOTA (1.08100) by -0.00218 bpb = -0.00564 nats/token, clearing the 0.005-nat threshold at p = 0.00136 (one-sided z = -3.00). All 3 artifacts fit under 16 MB (margins 8.8 KB / 5.8 KB / 3.9 KB). Training 588 s on all seeds; eval ~480 s on all seeds. Full C1-C4 compliance with Issue openai#1017. No SLOT, no ETLB, no pre-quant TTT, no n-gram cache.
himanshudongre
added a commit
to himanshudongre/parameter-golf
that referenced
this pull request
Apr 18, 2026
…companion to PR openai#1716) Structured documentation of the ablation path behind my record PR openai#1716 (3-seed mean 1.07882 bpb). Covers what was tested, what worked, what was killed, and the architectural reason behind each null result. TL;DR: - CONFIRMED: BIGRAM_DIM=32 + Path A v3 aggressive passthrough quantization (int8 control tensors + int8 small matrices + LZMA code wrapper) fits 15.99 MB at 0 bpb cost; this is the record mechanism. - NULL: TTT_EPOCHS=4 (Δ -0.00009 bpb, saturated), EVAL_SEQ_LEN=4096 + stride=128 (sliding +0.00555 due to OOD scored-tail positions), SWA w=1024 training 2x2 factorial (all configs strictly worse than baseline), TRAIN_SEQ_LEN=4096 direct training (position-depth-vs-breadth tradeoff: pre-quant better but sliding +0.00435), QAT v3 (pre-quant +0.015, TTT diverged to 1.48 via QAT x score-first-TTT lattice interaction), Adaptive Hadamard GPTQ (null on Muon sub-Gaussian weights). Structural finding: every attempt to exploit the eval-time working-memory asymmetry via the absolute-position RoPE architecture fails because sliding specifically measures position-depth at a narrow tail-band. The architecturally correct fix is relative-position attention (ALiBi / NoPE), not more aggressive positional extension. That is the next experiment. Includes 10 logs, 5 patches, 2 prototype modules, and full mechanism analysis for each null so the next person exploring this branch does not re-learn the same lessons at full training cost.
amrayach
added a commit
to amrayach/parameter-golf
that referenced
this pull request
Apr 18, 2026
…verlay Supersedes the 2026-04-18 post-Ultrareview pin (876bb36). Rev 5 adds provenance auto-capture, repo-type=model fix, exact-SHA env-var pin, and run_all.sh/README alignment; new pin reflects the pipeline-patch commit. Also records the live-guidance absolute-BPB overlay and 04b deprecation driven by open-PR competitive intel (openai#1700 / openai#1716 / openai#1707 / openai#1693). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
resouer
added a commit
to resouer/parameter-golf
that referenced
this pull request
Apr 19, 2026
The first W94 replay was not a faithful openai#1716 reproduction because the packed wrapper did not expose its intended vocab/data defaults to the evaluator's surface detection. That made the remote launcher fall back to the sp1024 data path. This patch sets the SP8192 defaults explicitly in the wrapper without changing the packed payload. Constraint: Round33 is validating openai#1716, not a launcher-derived sp1024 variant Rejected: Leave autodetection to the launcher | packed wrappers hide VOCAB_SIZE from the regex path Confidence: high Scope-risk: narrow Directive: Any packed submission replay should expose VOCAB_SIZE/DATA_PATH/TOKENIZER_PATH in the wrapper if we rely on evaluator autodetect Tested: python3 -m py_compile train_gpt.py evaluate.py remote_helper.py Not-tested: remote end-to-end score after relaunch
leon2k2k2k
added a commit
to leon2k2k2k/parameter-golf
that referenced
this pull request
Apr 20, 2026
…1716) Two orthogonal training-time levers queued behind spec 011: - bpb-weighted-loss.md (port openai#1519): weight CE by UTF-8 bytes per token. Aligns training objective with eval metric. Risk: SP8192 vocab destabilization (author warns on large vocabs) + CaseOps byte LUT accounting (~1hr of careful code). - bigram-hash-embed.md (port openai#1716): 16384×32 hash-table bigram embed added to token embedding pre-block-0. ~540K params / ~400KB artifact. openai#1736 genuinely lacks this despite prevalence in competitive lineages. Recommended sequencing: 011 → 012 (QK) → 013 (BigramHash, lower risk) → 014 (BPB-weighted, higher risk). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leon2k2k2k
added a commit
to leon2k2k2k/parameter-golf
that referenced
this pull request
Apr 20, 2026
110 LOC pure addition to train_gpt.py, fully env-gated by BIGRAM_HASH_ENABLED=0/1. Default-off invariant: with env unset the forward pass, state_dict, and optimizer param list are byte-identical to baseline. Components: - BigramHashEmbedding(nn.Module): embed(buckets, dim) + CastedLinear proj(dim, model_dim). proj._zero_init=True -> identity at step 0. Hash: ((prime_a * curr) ^ (prime_b * prev)) % buckets. Position-0 fallback: prev = curr (self-bigram). Cross-doc leakage not special cased, matching openai#1736's SmearGate convention. - GPT.__init__: creates self.bigram_embed when enabled else None. - forward_logits + forward_ttt: additive merge of bigram(input_ids) to tok_emb(input_ids) before SmearGate. attr-guarded. - Optimizers: embed.weight -> AdamW optimizer_tok (embed_wd), proj.weight -> Muon matrix_params. - GPTQ hessian hooks: bigram_embed.embed output -> (dim,dim) hessian; bigram_embed.proj input -> (dim,dim) hessian (proj is <=65536 numel so fp16 passthrough; harmless hook). - Startup log line echoing config. Sizing: 16384*32 int6 embed ~= 393KB. 512*32 fp16 proj = 32KB. Total ~425KB added to artifact; budget dry-run needed before launch. Env vars (defaults): BIGRAM_HASH_ENABLED=0, BIGRAM_HASH_BUCKETS=16384, BIGRAM_HASH_DIM=32, BIGRAM_HASH_PRIME_A=36313, BIGRAM_HASH_PRIME_B=27191. Bug lesson learned from exp/training-bundle commit 8d54854: when Edit's old_string only captures part of a for-loop body, trailing loop statements get pushed outside the loop and may be absorbed by nearby conditional blocks. This patch is a pure prepend/append style (no splits of existing blocks) so that failure mode is avoided. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
resouer
added a commit
to resouer/parameter-golf
that referenced
this pull request
Apr 20, 2026
The worker launcher was silently dropping the env surface needed by the openai#1716 record line, and packed-vocab detection still misread plain-string base85 wrappers as sp1024. This made the node-A repro fall back to defaults instead of the claimed SP8192 BigramHash32 + gate + TTT recipe. Constraint: Remote jobs must keep using the image-owned runtime while still honoring record-specific env knobs. Rejected: Hardcode openai#1716 inside train_gpt.py | would make the worker repo less reusable for the next frontier replay. Confidence: high Scope-risk: narrow Directive: Keep remote_helper packed-wrapper detection aligned with evaluate.py data-setup logic; otherwise faithful repros silently drift. Tested: py_compile for evaluate.py and remote_helper.py; remote_helper detect-vocab returns 8192 on the packed openai#1716 wrapper; _make_job_command includes VOCAB_SIZE/TRAIN_SHARDS_OVERRIDE override path. Not-tested: Full remote run after env forwarding change.
resouer
added a commit
to resouer/parameter-golf
that referenced
this pull request
Apr 20, 2026
Own-delta experiments started tuning test-time training, but the worker launcher only forwarded the boolean TTT flag and silently dropped TTT_LR / TTT_EPOCHS. That made a supposed TTT-lr sweep collapse back to the faithful baseline. Constraint: We need to iterate on openai#1716-derived own deltas without rewriting train_gpt.py per experiment. Rejected: Hardcode TTT_LR edits directly into the packed training wrapper | too slow and brittle for quick knob sweeps. Confidence: high Scope-risk: narrow Directive: Any launcher-driven experiment knob must be added to FORWARDED_JOB_ENV_KEYS, or the run may silently fall back to the parent surface. Tested: py_compile on evaluate.py. Not-tested: remote launch after this commit (will be verified by the next run).
This was referenced May 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
SP8192 stack with two tuning changes on top of the merged 2026-04-09 SP8192 SOTA (@bigbag, 1.08100):
d = 32(vs common d=48 / d=64)attn_scale,mlp_scale,resid_mix,skip_gates,skip_weights) and per-row int8 for three small 2-D matrices (bigram.proj,attn_gate_proj,smear_gate.weight) that are otherwise left as fp16 passthrough. Combined with an LZMA self-extracting code wrapper, the full int8 token-embedding + int6 matrix recipe now fits ≤ 16 MB.Results
Merged SOTA (2026-04-09 @bigbag): 1.08100 bpb. Delta: −0.00218 bpb = −0.00564 nats/token.
Statistical significance (one-sided z-test vs 0.005-nat threshold)
Compliance (per Issue #1017)
inference_mode()before any parameter updateAdditional: no SLOT, no pre-quant TTT on val, no ETLB, no n-gram cache, seeds match @bigbag's convention (42, 314, 999), artifact < 16 MB on all seeds, training ≤ 588 s, eval ≤ 481 s.
Files
README.md— full writeup with tables, compliance, statistical evidence, reproductionsubmission.json— structured metadata (per-seed results, compliance, attribution)train_gpt.py— 18,097-byte LZMA self-extracting submissiontrain_gpt_stacked_v2_fixed.py— unpacked reference source (53,514 B) for reviewabilitytrain_seed{42,314,999}.log— full training logsAttribution
Record path:
records/track_10min_16mb/2026-04-18_SP8192_BigramHash32_PathAv3/