Record: PR1855/PR1953 base + Progressive context growth (val_bpb: 1.05759, 3-seed)#2014
Open
simonbissonnette wants to merge 2 commits intoopenai:mainfrom
Open
Record: PR1855/PR1953 base + Progressive context growth (val_bpb: 1.05759, 3-seed)#2014simonbissonnette wants to merge 2 commits intoopenai:mainfrom
simonbissonnette wants to merge 2 commits intoopenai:mainfrom
Conversation
Fija
pushed a commit
to Fija/parameter-golf
that referenced
this pull request
Apr 30, 2026
Pull PR openai#2014's record dir from openai/parameter-golf and reproduce its 1.05759 3-seed mean. Key new levers vs openai#1953: EVAL_SEQ_LEN=3072, train_seq_schedule 1024->2048->3072, single-phase TTT (NUM_PHASES=1, PREFIX=2500), short-doc score-first chunking (TTT_SHORT_SCORE_FIRST_STEPS=256:8,2000:24). Even with our infra's ~1.5-2 milli-BPB inflation pattern, reproducing openai#2014 should land ~1.0590 — close enough to record bar to potentially clear it. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
leon2k2k2k
added a commit
to leon2k2k2k/parameter-golf
that referenced
this pull request
May 1, 2026
Port the 040C 'middle 5x / late 3.4x' allocation onto simonbissonnette's progressive-3k base (openai#2014) and screen vs uniform 4.0 baseline. Training-only, 4xH100 1200s, single seed. Code on exp/300-040c-on-2014 @ d174313. Spec flags the column-slice-in-compile hazard from feedback memory and mandates a compile-sanity check before scaling. PREQUANT_ONLY=1 keeps the screen cheap by skipping serialize/GPTQ/TTT. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Fija
pushed a commit
to Fija/parameter-golf
that referenced
this pull request
May 1, 2026
…ean 1.05831 BPB Clears record bar (1.05914) by 0.83 milli-BPB. Welch t = -6.49 vs PR openai#1855 (1.06108), p < 0.0001. All 3 seeds produce 15.99 MB artifacts under the 16 MB cap, all under the 600s wallclock budget. Per-seed: - 42: ttt=1.05793 art=15,986,149 eval=572.6s - 314: ttt=1.05852 art=15,987,257 eval=553.7s - 1234: ttt=1.05849 art=15,989,895 eval=574.1s Submission directory at records/track_10min_16mb/2026-04-30_PR2014_Reproduction_1.0583/ contains PR openai#2014's verbatim train_gpt.py + tokenizer + our seed_results.csv + a detailed README documenting the lineage (openai#1797 -> openai#1851 -> openai#1855 -> openai#1908 -> openai#1923 -> openai#1953 -> openai#2014), the new levers vs each parent, and the full 4-condition C1-C4 legality check. submission.json author/github_id are placeholders pending the user's choice of submitting account. Reproduction script: runpod/phase_x_pr2014.sh — runs end-to-end on a single 8xH100 SXM pod (~2.5h wall, ~$66 cost). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Idan3011
pushed a commit
to Idan3011/parameter-golf
that referenced
this pull request
May 1, 2026
This was referenced May 1, 2026
TanishGudise
added a commit
to TanishGudise/parameter-golf
that referenced
this pull request
May 1, 2026
AsymLogit Rescale (PR openai#1923) ported as 2 TTT-adaptable scalar params (softcap_pos, softcap_neg). Pre-quant 1.06160 (slightly worse than S55's 1.06058 — AsymLogit hurts un-adapted model). TTT recovery -0.01267 (much better than S55's -0.01103) — AsymLogit gives massive adaptive capacity. Final 1.05759 = -0.00055 vs S55. Single-seed matches PR openai#2014's 3-seed mean. Eval 521.7s (under 600s cap), Size 15,946,610. softcap_pos and softcap_neg init to logit_softcap=30.0, adapted per-doc via TTT-LoRA optimizer.
leon2k2k2k
added a commit
to leon2k2k2k/parameter-golf
that referenced
this pull request
May 1, 2026
User pushed back on openai#2014's LEAK call as too inference-based. Verified directly: - README says "uses same shards as PR openai#1855. If you don't have them, prepare with included prepare_caseops_data.py" — phrasing implies inheritance from openai#1855 (LEAK) but doesn't explicitly invoke prep - No setup.sh, no shell script invoking prep - No HF download script - Path /dev/shm/pgolf_caseops_data_80_l17_final is custom flat RAM-disk dir (not triple-nested local-prep signature) - Could be either HF-flattened download OR local-prep copy Demoted openai#2014 from LEAK to AMBIGUOUS (lean LEAK based on "same shards as openai#1855" English, but not iron-clad). Updated tally: CLEAN 9, LEAK 20 (was 21), AMBIGUOUS 4 (was 3), INHERIT 1.
TanishGudise
added a commit
to TanishGudise/parameter-golf
that referenced
this pull request
May 1, 2026
…E_OUTSIDE=0 Seed 314: pre-quant 1.06128 / quant 1.06962 / final 1.05701 / eval 571.7s Compliance: ngram_hint_precompute_outside=False, precompute (166.95s) INSIDE timer per PR openai#1514 precedent. Token-only tilt: within_gate=0, word_gate=0 - legal per PR openai#1514. Size 15,943,530 bytes. Single seed beats openai#2014's 3-seed mean (1.05759). Validating seeds 42 and 1234.
varunneal
added a commit
to varunneal/parameter-golf
that referenced
this pull request
May 1, 2026
TanishGudise
added a commit
to TanishGudise/parameter-golf
that referenced
this pull request
May 1, 2026
Beats PR openai#1855 (merged rank 1, 1.06108) by 0.00438 BPB. Beats PR openai#2014 (best open, 1.05759) by 0.00089 BPB. Beats PR openai#2060 (1.05792) by 0.00122 BPB. Stack: - Token-only n-gram tilt (PR openai#1514 merged precedent, within/word channels disabled) - AsymLogit Rescale (2 trainable scalars adapted by global TTT) - 3 hyperparameter levers from PR openai#2060 (MATRIX_LR=0.028, LQER_ASYM_GROUP=32, TTT_LORA_LR=8e-5) - PHASED_TTT_NUM_PHASES=1 (matches PR openai#2014) - NGRAM_HINT_PRECOMPUTE_OUTSIDE=0 (precompute INSIDE eval timer per PR openai#1514) Compliance: - All seeds eval ≤533.1s (cap 600s, 67-80s margin) - All artifacts ≤15.95MB (cap 16MB) - Token-only n-gram channel (within_gate=0, word_gate=0) - Score-first TTT (per PR openai#402)
varunneal
added a commit
to varunneal/parameter-golf
that referenced
this pull request
May 1, 2026
…LR=0.00015, WD=0.25
codemath3000
added a commit
to codemath3000/parameter-golf
that referenced
this pull request
May 2, 2026
This was referenced May 2, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Record candidate: SP8192 CaseOps + Progressive 3k Context Growth + Short-Doc Score-First TTT
val_bpb: 1.05759 (3-seed mean, std 0.00034) | val_loss: 2.31441 nats (std 0.00075) | 15.98 MB max | 8xH100 SXM | 600s train / 600s eval
Improvement over merged PR #1855 leaderboard record (1.06107587 BPB):
-0.00348 BPB / -0.00762 nats
This stacks a progressive training-context schedule and a short-document TTT schedule on top of the late-April CaseOps/SP8192/LQER/SparseAttnGate/BOS-fixed SmearGate lineage. The direct leaderboard comparison is PR #1855, which is the current merged leader used here as the baseline.
Results
3-seed population std: 0.00034091 BPB / 0.00074604 nats.
All included seeds are under the 16,000,000-byte artifact cap and the 600s train/eval budgets as logged. The maximum artifact is 15,984,387 bytes and the maximum validation-data TTT pass is 572.4s.
Full validation coverage
All three logs evaluate the full CaseOps validation shard target set:
val_tokenstarget_tokensThe training script explicitly keeps the validation tail via
EVAL_INCLUDE_TAIL=1. This avoids the older multiple-of-context truncation and makes the standard diagnostic eval and quantized TTT eval agree on the same target count.The tokenizer, CaseOps transform, training shards, validation shard, and byte sidecar format are the same canonical HF-hosted CaseOps export used by the merged PR #1855 setup. If a reviewer already has the clean #1855/HF CaseOps data staged, those same staged shards can be reused here. The included tokenizer/prep files are present only to make this submission self-contained; the preferred reproduction path is to download the canonical HF CaseOps export directly.
What changed vs PR #1855
This submission keeps the same overall 11-layer SP8192 CaseOps recurrent-transformer family as PR #1855, then adds the following levers:
[email protected],[email protected],[email protected]TRAIN_SEQ_LEN=3072,EVAL_SEQ_LEN=3072,TTT_EVAL_SEQ_LEN=3072,EVAL_STRIDE=1536TTT_MASK=no_qv,TTT_Q_LORA=0,TTT_V_LORA=0TTT_LOCAL_LR_MULT=0.75TTT_SHORT_SCORE_FIRST_STEPS=256:8,2000:24, default chunk 48PHASED_TTT_NUM_PHASES=1,PHASED_TTT_PREFIX_DOCS=2500QK_GAIN_INIT=5.25COMPRESSOR=pergroup, AWQ-lite, asymmetric logit rescaleThe short-doc TTT schedule does not train on future validation tokens. It only changes the chunk granularity used inside the existing score-before-update loop: each chunk is scored first, then the LoRA update is applied for future chunks.
Architecture and training stack
frac=0.35gate_window=12, scale 0.5BETA2=0.99ema_decay=0.9965GPTQ_RESERVE_SECONDS=4.0; logs showgptq:reserving 4s, effective=596000msCompliance notes
GPTQ_RESERVE_SECONDS=4.0; GPTQ hessian collection is logged immediately after (67 Hessians in 4.1s) for transparency.ttt_lora:compile warmupuses random tokens and no validation data; it is logged separately fromtotal_eval_time.quantized_ttt_phasedscores each chunk before applying that chunk's LoRA update. The short-doc schedule only changes chunk size.val_tokens == target_tokens == 47853343in all included logs.Reproduction
Install the dependencies in
requirements.txt. FlashAttention 3 and thelrzipsystem binary are noted there because they require separate install paths.This submission uses the clean canonical CaseOps SP8192 export hosted on Hugging Face. The logs were produced from a 50,000-document validation split with 80 training shards (
train_shards: 80,ttt_phased: total_docs:50000, andval_tokens == target_tokens == 47853343in every included log).Preferred data setup:
Then set:
Fallback local rebuild: if the HF export is unavailable, rebuild from the canonical
docs_selected.jsonlwith the includedprepare_caseops_data.py,lossless_caps.py, and tokenizer. Use--val-docs 50000and write into a fresh output directory. The prep script now defaults to 50,000 validation docs and refuses to write over existingfineweb_*.binshards unless--overwriteis passed, to avoid accidentally mixing stale validation shards with a new train split.Run one seed at a time, replacing
DATA_PATHandTOKENIZER_PATHwith the staged CaseOps paths:Included files
train_gpt.py- full training/eval script used for the logs.train_seed42.log,train_seed314.log,train_seed0.log- full per-seed logs.submission.json- structured metadata and per-seed results.README.md- this file.requirements.txt- Python dependencies plus notes for FA3 andlrzip.prepare_caseops_data.py- fallback CaseOps dataset/token/byte-sidecar preparation; defaults to the canonical 50,000-doc validation split and refuses mixed/stale output directories by default.lossless_caps.py- reversible CaseOps transform, same as the PR Record: SP8192 + LQER + Sparse Attn Gate + BOS-Fixed SmearGate + 9-Hparam Greedy Stack — val_bpb 1.06108 (3-seed mean) #1855 CaseOps setup.tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model- SentencePiece tokenizer used by the logs; identical CaseOps tokenizer lineage as PR Record: SP8192 + LQER + Sparse Attn Gate + BOS-Fixed SmearGate + 9-Hparam Greedy Stack — val_bpb 1.06108 (3-seed mean) #1855.Lineage and credits
This submission is a stack on top of the public CaseOps/SP8192 record lineage:
The new contribution here is the combination of progressive 3k train/eval context growth with the short-document score-first TTT chunk schedule, while preserving the full validation target count and staying under the artifact/eval budgets.