Record: CaseOps Tokenizer + Tapered WD - val_bpb 1.0678 (3-seed mean)#1729
Merged
cocohearts merged 1 commit intoopenai:mainfrom Apr 29, 2026
Merged
Record: CaseOps Tokenizer + Tapered WD - val_bpb 1.0678 (3-seed mean)#1729cocohearts merged 1 commit intoopenai:mainfrom
cocohearts merged 1 commit intoopenai:mainfrom
Conversation
dc1b643 to
59b55a5
Compare
resouer
added a commit
to resouer/parameter-golf
that referenced
this pull request
Apr 19, 2026
Round35 W99 was replaying standard sp8192 because the worker evaluator hard-coded the generic SP8192 downloader and tokenizer path. The worker now detects the CaseOps spec, downloads the romeerp HF export, enables validation byte sidecars, and points train_gpt at the lossless CaseOps dataset/tokenizer surface. Constraint: W99 must match PR openai#1729's public CaseOps dataset/tokenizer path closely enough to make the replay meaningful Rejected: Keep relaunching the generic SP8192 surface | it cannot validate the CaseOps claim Confidence: high Scope-risk: narrow Directive: If a future tokenizer lane ships a spec file plus custom downloader surface, patch evaluator data setup before trusting any replay Tested: python3 -m py_compile evaluate.py train_gpt.py data/cached_challenge_fineweb.py Not-tested: End-to-end HF CaseOps download on Lepton
dexhunter
added a commit
to dexhunter/parameter-golf
that referenced
this pull request
Apr 19, 2026
… — val_bpb 1.06549 3-seed mean 1.06549 (std 0.00070) on 8×H100 SXM, all gates green: - artifact 15,975,120 bytes mean (≤16,000,000 DECIMAL) - train_time 596.14s mean (≤600s) - total_eval_time 397.23s mean (≤600s) Builds on PR openai#1530 SP8192 stack. Adopts CaseOps (lossless_caps_caseops_v1) bijective case preprocessing from PR openai#1729 with a per-token byte sidecar so BPB is scored on original pre-transform UTF-8 bytes. Adds a learned attention out-gate (init_std=0.005) + quant-gate scaling that recovers the ~40 KB of overhead introduced by the new control tokens, keeping every seed under the 16 MB decimal cap. Seeds: 42 (1.06610), 0 (1.06473), 1234 (1.06563).
Merged
5 tasks
alertcat
added a commit
to alertcat/parameter-golf
that referenced
this pull request
Apr 19, 2026
Adds byte sidecar loading to enable CaseOps lossless-case tokenizer (PR openai#1729). Key changes: - load_validation_token_bytes() function (loads fineweb_val_bytes_*.bin) - ValidationData.val_token_bytes field with sidecar fallback to LUT - eval_val/eval_val_sliding/eval_val_ttt prefer sidecar when available - TTT_EMA_ENABLED default 1 -> 0 (V14 lesson: EMA hurts monotonic-decrease TTT) V14 EMA result: 1.0427 (worse than baseline due to monotonic TTT loss). V15 hypothesis: CaseOps gives -0.005 to -0.012 BPB by saving bits via case dedup, landing in 1.030-1.038 range (50% chance of breaking record at 1.0357).
sunnypatneedi
pushed a commit
to sunnypatneedi/parameter-golf
that referenced
this pull request
Apr 19, 2026
…ad, MP-SGD TTT 4-phase - PR openai#1698 (GDN FLA, claimed 1.00995): BPB bug confirmed by dexhunter (~1.189 actual) + artifact size violation; effectively dead - New technique: CaseOps bijective tokenizer (PR openai#1729/openai#1736/openai#1738) — reversible case-factoring with byte sidecar; stronger legality than casefold; await Issue openai#1604 ruling - PR openai#1735 (pre-quant TTT 21ep) flagged illegal by dexhunter; PR openai#1738 builds on it, both likely void - PR openai#1727 (MP-SGD TTT 4 phases, 1.07217): appears legal, stackable - Merged SOTA 1.0810 Day 10 plateau; 11 days to deadline https://claude.ai/code/session_012mo6412sGQRVjF7TDmfx31
leon2k2k2k
added a commit
to leon2k2k2k/parameter-golf
that referenced
this pull request
Apr 20, 2026
…approach Read openai#1695's diff. Their approach is fundamentally different from the static-weight-rotation + folds design I had in mind for 'full' mode. They do ONLINE activation rotation: 4 global Hadamard rotations inserted as x @ R matmuls at 4 forward-pass sites (residual->qkv, attn-out->proj, residual->fc, hidden->proj). GPTQ then quantizes in the rotated basis; rotated Hessians keep the quant-side accounting honest. Rotations OFF during training, ON after deserialize for eval+TTT. Why this matters: their scheme sidesteps BOTH blockers that made the full mode complicated: - LeakyReLU non-equivariance: R_mlp_proj_in is applied AFTER the LeakyReLU-square, not across it. - resid_mix: rotations are per-linear-input, never touch the residual stream. All per-channel multipliers (attn_scale, mlp_scale, resid_mix, skip_weights) operate in unchanged basis. No float invariance — the model IS different post-rotation. The bet is that the rotated-basis GPTQ delivers lower quant error and that the perturbation is smaller than the savings. Implication: deprecate the 'full' static-rotation-with-folds plan in favor of a future 'port_1695' spec that ports their online scheme. Internal_only mode from spec 009 remains useful as an independent data point (R_a only, fp-invariant). Spec 010 (tapered WD) drafted as an independent parallel track: - Ports PR openai#1729's WD_TAPER_START_FRAC=0.70, WD_TAPER_FINAL_MULT=0.50 - Muon-WD-only taper on top of openai#1736's existing schedule - Full retrain on 8xH100, single seed, ~\$20 - Independent of spec 009 (different pod, no shared state) - Can run in parallel with 009's eval-only sweep Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
leon2k2k2k
added a commit
to leon2k2k2k/parameter-golf
that referenced
this pull request
Apr 20, 2026
User flagged that port_1695 should be the next spec (higher-impact, natural follow-up to 009) rather than tapered WD. Reshuffled: - 010-port-1695-online-rotation.md (NEW) — port openai#1695's online Hadamard rotation with rotated-basis GPTQ. Hotstart off spec 008 pre_gptq.pt. Expected Delta -0.003 to -0.005 bpb vs spec 009 baseline. ~\$10, 8xH100. - 011-tapered-wd.md (renumbered from 010) — Muon WD taper from openai#1729. Full retrain, ~\$20. Independent of specs 009/010, can run in parallel. Spec 010 inherits the design analysis from research/ideas/ spinquant-integration-notes.md (addendum section). Depends on spec 009 baseline measurement for apples-to-apples Delta. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
leon2k2k2k
added a commit
to leon2k2k2k/parameter-golf
that referenced
this pull request
Apr 20, 2026
Bundles three orthogonal training-time levers into one retrain: - tapered Muon WD (port openai#1729, originally spec 011) - GradPower p=0.9 (port openai#1682) - softer QK_GAIN init 5.0 → 2.5 (port openai#1648, simplified from per-layer convergence) Code patch at exp/training-bundle (commit 8d54854). All env-gated with no-op defaults. Supersedes spec 011 which is kept as a design-doc reference. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
leon2k2k2k
added a commit
to leon2k2k2k/parameter-golf
that referenced
this pull request
Apr 20, 2026
Four new env-gated hyperparameters, all default to no-op so spec 008 is byte-identical when the vars are unset: - WD_TAPER_START_FRAC / WD_TAPER_FINAL_MULT (port openai#1729): linear Muon WD taper from 1.0 at start_step to final_mult at h.iterations. Applied in step_fn before optimizers.step. Adam/embed WD untouched per openai#1729. - MUON_GRAD_POWER (port openai#1682): g = sign(g) * |g|^p, applied to Muon gradients just before the momentum buffer update. Covers both sharded (shard path) and non-sharded paths. - QK_GAIN_INIT (existing): already present, lowering default not changed; setting QK_GAIN_INIT=2.5 at runtime gives uniform softer attention per openai#1648's convergence finding. - QK_GAIN_PER_LAYER (new): comma-sep list, overrides each block's attn.q_gain after block construction. Validated to match num_layers. Also: one startup log line echoing the four values for post-hoc verification. Spec: research/specs/012-training-bundle.md. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
5 tasks
5 tasks
8 tasks
dexhunter
added a commit
to dexhunter/parameter-golf
that referenced
this pull request
Apr 24, 2026
…_data.py The shipped `_token_original_byte_counts` used a try/except surface-walk that attributed 3 UTF-8 bytes to bare U+E001..U+E004 operator pieces AND failed to advance `cursor_o`, over-counting validation bytes by ~8.37% on FineWeb. The training sidecar actually used (built from a different internal path via `surface_piece_original_byte_counts`) is correct, so the submitted 1.06157 / 1.06549 metrics are unaffected — but the shipped prep script could not reproduce the sidecar from a cold checkout. Swap the buggy inline walker for a direct delegation to `surface_piece_original_byte_counts` from `lossless_caps.py` (the same canonical exporter used by PR openai#1729 / the HF-hosted dataset). Verified on 500 FineWeb val docs: patched output matches the shipped sidecar token-for-token (0 mismatches) and byte-sum matches true UTF-8 exactly. Also clean up README prose for the 04-24 record: SmearGate is a gate on the first GATE_WINDOW=12 feature dims of x_t adding a 1-token causal lookback (not a 12-token residual window); LQER asymmetric stores A as INT2 per-matrix and B as INT4 per-group-64 and selects K=3 whole tensors globally (not per-row output columns).
aquariouseworkman
added a commit
to aquariouseworkman/parameter-golf
that referenced
this pull request
Apr 27, 2026
…symmetric + Phased TTT val_bpb = 1.06128 | ~15.95 MB | 8xH100 SXM Key Change: SmearGate BOS Document Boundary Fix Builds on PR openai#1797 stack (PR openai#1787 base + SmearGate + LQER Asymmetric) but fixes the SmearGate cross-document leakage bug identified by @cocohearts in PR openai#1797 audit. The bug: SmearGate 1-token causal lookback does not mask BOS positions, so the final token of document N smears into BOS of document N+1. Credits @nprime06 -- PR openai#1787 base stack @romeerp -- CaseOps transform (PR openai#1729) @dexhunter -- SmearGate + LQER (PR openai#1797) @cocohearts -- Identifying SmearGate BOS bug @abaybektursun -- Score-first TTT (PR openai#549) @clarkkev -- GPTQ SDClip + SP8192 (PR openai#1394)
Fija
pushed a commit
to Fija/parameter-golf
that referenced
this pull request
Apr 28, 2026
Phase J (one-time data prep, done): - train_sp10240_caseops.py: train SentencePiece BPE at vocab=10240 over CaseOps-transformed FineWeb. Reserves U+E001..U+E005 as user-defined symbols (matches PR openai#1729 / SP8192 reservation set). 96-worker, ~25 min. - prepare_caseops_data_parallel.py with --sp pointing at the new model produces SP10240 caseops shards (~27 GB). Uploaded to private HF dataset hf://FijaEE/parameter-golf-sp10240-caseops (1434 train + 5 val + 5 val_bytes shards). - Tokenizer model + vocab file committed under tokenizers/ for git clone. Phase K (TTT params budget tradeoff, ready to run): - runpod/phase_k_ttt_tradeoff.sh: train SP8192 V2 baseline once on 8xH100 (~10 min, saves model.bin), then run TTT_EVAL_ONLY=1 for 4 configs reusing the saved artifact: K0: grad=1 prefix=2000 phases=3 ctx=2048 (V2 baseline) K1: grad=2 prefix=2000 phases=3 ctx=2048 (oracle, expected over-budget) K2: grad=2 prefix=1500 phases=1 ctx=2048 (cut prefix+phases) K3: grad=2 prefix=2000 phases=3 ctx=1024 (cut ctx) Auto-picks the lowest-BPB config that fits 600s for Phase L. Phase L (3-seed combo, parametrized by Phase K winner): - runpod/phase_l_combo.sh: PR openai#1797 V2 stack + SP10240 + LoRA rank 96 + best TTT params from K. Runs 3 seeds (42, 314, 1234), reports Welch t-test vs PR openai#1797 (1.06157±0.00066) and the 0.005-nat record bar. Hypothesis (per user observation): vocab progression 1024→2048→4096→8192 has been monotonically beneficial; no one in the queue has tried sp10240 without PPM-D. PR openai#1814's lowercase-SP10240 single-seed (1.0742) suggests ~ -0.0015 BPB delta from vocab alone vs PR openai#1797's V2 SP8192 baseline (1.05998 seed-42). Combined with TTT 2-step bump (PR openai#1812 showed 4-epoch delivered -0.008 BPB on a different stack) and LoRA rank 96, total expected ~1.045-1.055 BPB if Phase K finds a feasible budget.
6 tasks
8 tasks
5 tasks
sunnypatneedi
pushed a commit
to sunnypatneedi/parameter-golf
that referenced
this pull request
Apr 29, 2026
… new SOTA 1.0608 imminent; PPM-D concerns raised; final day - Discovered organizer has 2 pending branches staging 14 new leaderboard records - BOS-fix branch confirms CaseOps LEGAL (PRs openai#1729/openai#1736/openai#1769/openai#1787 included as records) - New SOTA when merged: 1.0608 (codemath3000, PR openai#1855); new target ≤1.0558 - Tap-In V6 (PR openai#1518) confirmed legal by organizer branch inclusion - PPM-D: @valerio-oai raised concerns on PR openai#1835 (3M/40.5M partial data + autoregressivity); do not implement - SmearGate BOS fix required (top entry PR openai#1855 uses it) - Updated CLAUDE.md competition strategy + added Session 24 lessons learned - Added Apr 29 daily research log entry https://claude.ai/code/session_01AAiiKSwWxDtGTexxogAkeZ
12 tasks
10 tasks
hilbertmeng
pushed a commit
to hilbertmeng/parameter-golf
that referenced
this pull request
Apr 30, 2026
… — val_bpb 1.06549 3-seed mean 1.06549 (std 0.00070) on 8×H100 SXM, all gates green: - artifact 15,975,120 bytes mean (≤16,000,000 DECIMAL) - train_time 596.14s mean (≤600s) - total_eval_time 397.23s mean (≤600s) Builds on PR openai#1530 SP8192 stack. Adopts CaseOps (lossless_caps_caseops_v1) bijective case preprocessing from PR openai#1729 with a per-token byte sidecar so BPB is scored on original pre-transform UTF-8 bytes. Adds a learned attention out-gate (init_std=0.005) + quant-gate scaling that recovers the ~40 KB of overhead introduced by the new control tokens, keeping every seed under the 16 MB decimal cap. Seeds: 42 (1.06610), 0 (1.06473), 1234 (1.06563).
Open
5 tasks
6 tasks
6 tasks
This was referenced May 1, 2026
This was referenced May 1, 2026
leon2k2k2k
added a commit
to leon2k2k2k/parameter-golf
that referenced
this pull request
May 1, 2026
Audits every CaseOps-lineage record-track PR (merged + unmerged) since 2026-04-18 for whether val docs are also in the training set. Working set: 34 PRs (31 from chronological seed list + 3 discovered ancestors: openai#1908, openai#1923, openai#2007). Boundary nodes openai#1493 / openai#1626 (pre-CaseOps). Verdicts: - CLEAN (8): openai#1729, openai#1851, openai#1868, openai#1908, openai#2019, openai#2027, openai#2031, openai#2068 - LEAK (25): openai#1736 (our research baseline) → openai#1769 → openai#1787 → openai#1797 → openai#1855 → V21 family (openai#1945, openai#1923, openai#1953, openai#1967) → openai#2018 → openai#2118 (current claimed frontier 1.04350), plus siblings. - INHERIT (1): openai#2050 (eval-only on frozen openai#1915) Code-level evidence (not README claims): - Every shipped prepare_caseops_data.py is byte-identical: SHARD_TOKENS=10_000_000, default=10_000 for --val-docs - NO PR overrides --val-docs (searched all .sh files in all 34 PRs) - cached_challenge_fineweb.py downloads from romeerp/parameter-golf-caseops-v1 HF dataset whose manifest pins docs_val=50000, docs_train=8181945, sums match → CLEAN by construction - PR openai#2018's DATASET_AUDIT.md is gold-standard explicit leak description - PR openai#2118's submission.json admits "--val-docs=10000 train shards + 50k val eval" Three signposts: - Leak introduced: PR openai#1736 by @dexhunter (Apr 19) — first prepare_caseops_data.py default invocation - Leak fixed: PR openai#1851 by @aquariouseworkman (Apr 27) — switched to HF dataset - Leak re-introduced: PR openai#1855 by @codemath3000 (same day) — rebuilt locally The merged-leaderboard SOTA (openai#1851/openai#1868 at 1.06128/1.06141) is CLEAN. The unmerged frontier (openai#2118 at 1.04350) is LEAK. The 0.018 bpb gap is inflated by val memorization; spec 301 was designed to measure how much remains under clean data. Files: caseops-memory-leakage/README.md — overview, methodology, takeaways caseops-memory-leakage/verdicts.md — 34-row master table with evidence caseops-memory-leakage/family-tree.md — ASCII trees with [C]/[L] annotations
5 tasks
anmarhindi
added a commit
to anmarhindi/parameter-golf
that referenced
this pull request
May 2, 2026
…0.979556) The cond-PPM mixer used SP-piece UTF-8 bytes (incl. CaseOps sentinel overhead, 164,594,398 per seed) as the BPB denominator instead of the canonical raw-text sidecar (151,074,309 per seed) used by every other CaseOps-lineage record per PR openai#1729 convention. Reported by @codemath3000 on PR openai#2138; thank you. Per-token NLL is invariant under denominator change, so the correction is algebraic — no re-eval required, original artifact and logs preserved as forensic record. New per-seed BPB = old × 164594398 / 151074309 = old × 1.089493: seed 42: 0.97949078 -> 1.067148 seed 1337: 0.97954725 -> 1.067210 seed 314: 0.97962885 -> 1.067299 mean: 0.979556 -> 1.067219 (std ~7.6e-05) On the canonical denominator the submission is +0.006 BPB worse than PR openai#1855 SOTA (1.06108), so this is no longer a SOTA-claim. LBM still gives a real -0.034 BPB improvement over sliding-window-alone (1.101347) on the canonical denominator; the C2-correctness story is unchanged. This commit only patches interpretation: - README.md: prepend Errata section, corrected 3-seed table, source- line citations, algebraic derivation; reposition writeup as not-SOTA. Original technique writeup retained below. - submission.json: corrected val_bpb / val_bpb_per_seed / std / eval_canonical_byte_count_per_seed / headline_metric_description; add errata{} object with summary, original values, inflation ratio, credit, fix-branch pointer. Forensic items deliberately untouched: train_gpt.py (wrapped, contains buggy denominator), final_model.int6.ptz, train_seed*.log (each shows both the buggy 'cond_ppm bytes=164594398' line and the canonical- correct 'quantized_sliding_window val_bpb' line — the sidecar count 151,074,309 is reverse-solvable from the latter). Fix lives on cond-ppm-stack of github.com/anmarhindi/parameter-golf-a. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
sp8192with a lossless CaseOps tokenizer / dataset export hosted publicly atromeerp/parameter-golf-caseops-v1WD_TAPER_START_FRAC=0.70,WD_TAPER_FINAL_MULT=0.50Results
All 3 seeds are under the 600s train budget, under the 600s eval budget, and under the 16 MB artifact cap.
Method
This submission combines two ideas:
Lossless CaseOps tokenizer
lossless_caps_caseops_v1transform.TITLE,ALLCAPS,CAPNEXT,ESC).Mild tapered weight decay
The base architecture and legal phased-TTT evaluation flow come from PR #1626; this submission changes the tokenizer/data path and late WD schedule.
Why the tokenizer is still metric-correct
The exporter writes:
fineweb_val_000000.binfineweb_val_bytes_000000.binThe trainer loads the byte sidecar directly and logs:
val_bpb:byte_sidecar:enabledSo BPB is computed against original raw-byte counts, not the transformed token stream length.
Legality
Reproducibility
Public artifacts:
romeerp/parameter-golf-caseops-v1The record folder includes:
train_gpt.pyREADME.mdsubmission.jsonrequirements.txtcached_challenge_fineweb.pydownload_hf_docs_and_tokenize.pylossless_caps.pytokenizer_specs_export_caseops_v1_reserved_only.jsonRun instructions
From the record directory:
cd records/track_10min_16mb/2026-04-18_PR1626_CaseOps_TaperPrepare the public HF tokenizer + dataset:
Train + quantize + phased eval for one seed: