Skip to content

Record: CaseOps Tokenizer + Tapered WD - val_bpb 1.0678 (3-seed mean)#1729

Merged
cocohearts merged 1 commit intoopenai:mainfrom
romeerp:codex/caseops-pr1626-taper
Apr 29, 2026
Merged

Record: CaseOps Tokenizer + Tapered WD - val_bpb 1.0678 (3-seed mean)#1729
cocohearts merged 1 commit intoopenai:mainfrom
romeerp:codex/caseops-pr1626-taper

Conversation

@romeerp
Copy link
Copy Markdown
Contributor

@romeerp romeerp commented Apr 19, 2026

Summary

  • val_bpb: 1.06780 (3-seed mean, std 0.00037) | ~15.94 MB | 8xH100 80GB SXM
  • Builds on PR Record: VarLen Attention + Fused MLP + Multi-Phase Global SGD TTT — val_bpb 1.07193 (3-seed mean) #1626's legal multi-phase TTT stack
  • Replaces standard sp8192 with a lossless CaseOps tokenizer / dataset export hosted publicly at romeerp/parameter-golf-caseops-v1
  • Adds a mild late Muon WD taper: WD_TAPER_START_FRAC=0.70, WD_TAPER_FINAL_MULT=0.50
  • Keeps phased TTT legal and score-first while improving pretrained, quantized, and post-TTT BPB

Results

Seed Steps Pre-Quant BPB Quantized BPB Post-TTT BPB Artifact
0 4,921 1.07032992 1.08152131 1.06805820 15,932,307
42 4,866 1.07065549 1.08171495 1.06806595 15,935,802
1234 4,870 1.06971629 1.08036614 1.06727867 15,943,106
Mean 1.07023390 1.08120080 1.06780094 15,937,072

All 3 seeds are under the 600s train budget, under the 600s eval budget, and under the 16 MB artifact cap.

Method

This submission combines two ideas:

  1. Lossless CaseOps tokenizer

    • Uses the reversible lossless_caps_caseops_v1 transform.
    • Factorizes text into a lowercase lexical stream plus a tiny capitalization side channel (TITLE, ALLCAPS, CAPNEXT, ESC).
    • Original text is reconstructed exactly by replaying those capitalization operators over the lowercase stream, so no information is discarded.
    • This reduces redundant case fragmentation in the main token stream while preserving exact recoverability.
    • Validation BPB is still charged against exact original UTF-8 bytes using exported validation byte sidecars.
  2. Mild tapered weight decay

    • Keeps full Muon WD early in training when regularization/compressibility pressure matters most.
    • Tapers to half the base WD late in training, where weights are more settled and the optimization benefit appears to outweigh the regularization benefit.

The base architecture and legal phased-TTT evaluation flow come from PR #1626; this submission changes the tokenizer/data path and late WD schedule.

Why the tokenizer is still metric-correct

The exporter writes:

  • fineweb_val_000000.bin
  • fineweb_val_bytes_000000.bin

The trainer loads the byte sidecar directly and logs:

  • val_bpb:byte_sidecar:enabled

So BPB is computed against original raw-byte counts, not the transformed token stream length.

Legality

  • Attention remains causal.
  • Scoring uses standard normalized cross-entropy.
  • Phased TTT remains score-first in the PR Record: VarLen Attention + Fused MLP + Multi-Phase Global SGD TTT — val_bpb 1.07193 (3-seed mean) #1626 sense.
  • No validation token is used for adaptation before its score is counted.
  • The tokenizer change is legality-preserving because it is a fully reversible preprocessing transform applied uniformly before tokenization.
  • No information is discarded: the original text can be reconstructed exactly from the lowercase stream plus the capitalization operators.
  • BPB is still charged against the original raw UTF-8 bytes through the exported validation byte sidecar, not against transformed text length.

Reproducibility

Public artifacts:

  • Dataset + tokenizer: romeerp/parameter-golf-caseops-v1

The record folder includes:

  • train_gpt.py
  • README.md
  • submission.json
  • requirements.txt
  • cached_challenge_fineweb.py
  • download_hf_docs_and_tokenize.py
  • lossless_caps.py
  • tokenizer_specs_export_caseops_v1_reserved_only.json
  • all 3 seed logs

Run instructions

From the record directory:

cd records/track_10min_16mb/2026-04-18_PR1626_CaseOps_Taper

Prepare the public HF tokenizer + dataset:

MATCHED_FINEWEB_REPO_ID=romeerp/parameter-golf-caseops-v1 \
MATCHED_FINEWEB_REMOTE_ROOT_PREFIX=datasets \
python3 cached_challenge_fineweb.py \
  --variant sp8192_lossless_caps_caseops_v1_reserved \
  --train-shards 80

Train + quantize + phased eval for one seed:

NCCL_NET=Socket \
SEED=0 \
TOKENIZER_PATH=./tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model \
DATASETS_DIR=./datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved \
torchrun --standalone --nproc_per_node=8 train_gpt.py \
  > train_seed0.log 2>&1

@romeerp romeerp force-pushed the codex/caseops-pr1626-taper branch from dc1b643 to 59b55a5 Compare April 19, 2026 02:15
@romeerp romeerp marked this pull request as ready for review April 19, 2026 02:26
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 19, 2026
Round35 W99 was replaying standard sp8192 because the worker evaluator hard-coded the generic SP8192 downloader and tokenizer path. The worker now detects the CaseOps spec, downloads the romeerp HF export, enables validation byte sidecars, and points train_gpt at the lossless CaseOps dataset/tokenizer surface.

Constraint: W99 must match PR openai#1729's public CaseOps dataset/tokenizer path closely enough to make the replay meaningful

Rejected: Keep relaunching the generic SP8192 surface | it cannot validate the CaseOps claim

Confidence: high

Scope-risk: narrow

Directive: If a future tokenizer lane ships a spec file plus custom downloader surface, patch evaluator data setup before trusting any replay

Tested: python3 -m py_compile evaluate.py train_gpt.py data/cached_challenge_fineweb.py

Not-tested: End-to-end HF CaseOps download on Lepton
dexhunter added a commit to dexhunter/parameter-golf that referenced this pull request Apr 19, 2026
… — val_bpb 1.06549

3-seed mean 1.06549 (std 0.00070) on 8×H100 SXM, all gates green:
- artifact 15,975,120 bytes mean (≤16,000,000 DECIMAL)
- train_time 596.14s mean (≤600s)
- total_eval_time 397.23s mean (≤600s)

Builds on PR openai#1530 SP8192 stack. Adopts CaseOps (lossless_caps_caseops_v1)
bijective case preprocessing from PR openai#1729 with a per-token byte sidecar
so BPB is scored on original pre-transform UTF-8 bytes. Adds a learned
attention out-gate (init_std=0.005) + quant-gate scaling that recovers
the ~40 KB of overhead introduced by the new control tokens, keeping
every seed under the 16 MB decimal cap.

Seeds: 42 (1.06610), 0 (1.06473), 1234 (1.06563).
alertcat added a commit to alertcat/parameter-golf that referenced this pull request Apr 19, 2026
Adds byte sidecar loading to enable CaseOps lossless-case tokenizer (PR openai#1729).
Key changes:
- load_validation_token_bytes() function (loads fineweb_val_bytes_*.bin)
- ValidationData.val_token_bytes field with sidecar fallback to LUT
- eval_val/eval_val_sliding/eval_val_ttt prefer sidecar when available
- TTT_EMA_ENABLED default 1 -> 0 (V14 lesson: EMA hurts monotonic-decrease TTT)

V14 EMA result: 1.0427 (worse than baseline due to monotonic TTT loss).
V15 hypothesis: CaseOps gives -0.005 to -0.012 BPB by saving bits via case dedup,
landing in 1.030-1.038 range (50% chance of breaking record at 1.0357).
sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Apr 19, 2026
…ad, MP-SGD TTT 4-phase

- PR openai#1698 (GDN FLA, claimed 1.00995): BPB bug confirmed by dexhunter
  (~1.189 actual) + artifact size violation; effectively dead
- New technique: CaseOps bijective tokenizer (PR openai#1729/openai#1736/openai#1738) —
  reversible case-factoring with byte sidecar; stronger legality than
  casefold; await Issue openai#1604 ruling
- PR openai#1735 (pre-quant TTT 21ep) flagged illegal by dexhunter; PR openai#1738
  builds on it, both likely void
- PR openai#1727 (MP-SGD TTT 4 phases, 1.07217): appears legal, stackable
- Merged SOTA 1.0810 Day 10 plateau; 11 days to deadline

https://claude.ai/code/session_012mo6412sGQRVjF7TDmfx31
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
…approach

Read openai#1695's diff. Their approach is fundamentally different from
the static-weight-rotation + folds design I had in mind for 'full'
mode. They do ONLINE activation rotation: 4 global Hadamard rotations
inserted as x @ R matmuls at 4 forward-pass sites (residual->qkv,
attn-out->proj, residual->fc, hidden->proj). GPTQ then quantizes in
the rotated basis; rotated Hessians keep the quant-side accounting
honest. Rotations OFF during training, ON after deserialize for
eval+TTT.

Why this matters: their scheme sidesteps BOTH blockers that made the
full mode complicated:
- LeakyReLU non-equivariance: R_mlp_proj_in is applied AFTER the
  LeakyReLU-square, not across it.
- resid_mix: rotations are per-linear-input, never touch the
  residual stream. All per-channel multipliers (attn_scale,
  mlp_scale, resid_mix, skip_weights) operate in unchanged basis.

No float invariance — the model IS different post-rotation. The bet
is that the rotated-basis GPTQ delivers lower quant error and that
the perturbation is smaller than the savings.

Implication: deprecate the 'full' static-rotation-with-folds plan in
favor of a future 'port_1695' spec that ports their online scheme.
Internal_only mode from spec 009 remains useful as an independent
data point (R_a only, fp-invariant).

Spec 010 (tapered WD) drafted as an independent parallel track:
- Ports PR openai#1729's WD_TAPER_START_FRAC=0.70, WD_TAPER_FINAL_MULT=0.50
- Muon-WD-only taper on top of openai#1736's existing schedule
- Full retrain on 8xH100, single seed, ~\$20
- Independent of spec 009 (different pod, no shared state)
- Can run in parallel with 009's eval-only sweep

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
User flagged that port_1695 should be the next spec (higher-impact,
natural follow-up to 009) rather than tapered WD. Reshuffled:

- 010-port-1695-online-rotation.md (NEW) — port openai#1695's online
  Hadamard rotation with rotated-basis GPTQ. Hotstart off spec 008
  pre_gptq.pt. Expected Delta -0.003 to -0.005 bpb vs spec 009
  baseline. ~\$10, 8xH100.

- 011-tapered-wd.md (renumbered from 010) — Muon WD taper from openai#1729.
  Full retrain, ~\$20. Independent of specs 009/010, can run in
  parallel.

Spec 010 inherits the design analysis from research/ideas/
spinquant-integration-notes.md (addendum section). Depends on spec
009 baseline measurement for apples-to-apples Delta.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
Bundles three orthogonal training-time levers into one retrain:
- tapered Muon WD (port openai#1729, originally spec 011)
- GradPower p=0.9 (port openai#1682)
- softer QK_GAIN init 5.0 → 2.5 (port openai#1648, simplified from per-layer
  convergence)

Code patch at exp/training-bundle (commit 8d54854). All env-gated with
no-op defaults.

Supersedes spec 011 which is kept as a design-doc reference.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
Four new env-gated hyperparameters, all default to no-op so spec 008 is
byte-identical when the vars are unset:

- WD_TAPER_START_FRAC / WD_TAPER_FINAL_MULT (port openai#1729): linear Muon WD
  taper from 1.0 at start_step to final_mult at h.iterations. Applied in
  step_fn before optimizers.step. Adam/embed WD untouched per openai#1729.
- MUON_GRAD_POWER (port openai#1682): g = sign(g) * |g|^p, applied to Muon
  gradients just before the momentum buffer update. Covers both sharded
  (shard path) and non-sharded paths.
- QK_GAIN_INIT (existing): already present, lowering default not changed;
  setting QK_GAIN_INIT=2.5 at runtime gives uniform softer attention per
  openai#1648's convergence finding.
- QK_GAIN_PER_LAYER (new): comma-sep list, overrides each block's
  attn.q_gain after block construction. Validated to match num_layers.

Also: one startup log line echoing the four values for post-hoc verification.

Spec: research/specs/012-training-bundle.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
dexhunter added a commit to dexhunter/parameter-golf that referenced this pull request Apr 24, 2026
…_data.py

The shipped `_token_original_byte_counts` used a try/except surface-walk
that attributed 3 UTF-8 bytes to bare U+E001..U+E004 operator pieces AND
failed to advance `cursor_o`, over-counting validation bytes by ~8.37%
on FineWeb. The training sidecar actually used (built from a different
internal path via `surface_piece_original_byte_counts`) is correct, so
the submitted 1.06157 / 1.06549 metrics are unaffected — but the shipped
prep script could not reproduce the sidecar from a cold checkout.

Swap the buggy inline walker for a direct delegation to
`surface_piece_original_byte_counts` from `lossless_caps.py` (the same
canonical exporter used by PR openai#1729 / the HF-hosted dataset). Verified
on 500 FineWeb val docs: patched output matches the shipped sidecar
token-for-token (0 mismatches) and byte-sum matches true UTF-8 exactly.

Also clean up README prose for the 04-24 record: SmearGate is a gate
on the first GATE_WINDOW=12 feature dims of x_t adding a 1-token
causal lookback (not a 12-token residual window); LQER asymmetric
stores A as INT2 per-matrix and B as INT4 per-group-64 and selects
K=3 whole tensors globally (not per-row output columns).
aquariouseworkman added a commit to aquariouseworkman/parameter-golf that referenced this pull request Apr 27, 2026
…symmetric + Phased TTT

val_bpb = 1.06128 | ~15.95 MB | 8xH100 SXM

Key Change: SmearGate BOS Document Boundary Fix
Builds on PR openai#1797 stack (PR openai#1787 base + SmearGate + LQER Asymmetric) but fixes the SmearGate cross-document leakage bug identified by @cocohearts in PR openai#1797 audit.

The bug: SmearGate 1-token causal lookback does not mask BOS positions, so the final token of document N smears into BOS of document N+1.

Credits
@nprime06 -- PR openai#1787 base stack
@romeerp -- CaseOps transform (PR openai#1729)
@dexhunter -- SmearGate + LQER (PR openai#1797)
@cocohearts -- Identifying SmearGate BOS bug
@abaybektursun -- Score-first TTT (PR openai#549)
@clarkkev -- GPTQ SDClip + SP8192 (PR openai#1394)
Fija pushed a commit to Fija/parameter-golf that referenced this pull request Apr 28, 2026
Phase J (one-time data prep, done):
- train_sp10240_caseops.py: train SentencePiece BPE at vocab=10240 over
  CaseOps-transformed FineWeb. Reserves U+E001..U+E005 as user-defined
  symbols (matches PR openai#1729 / SP8192 reservation set). 96-worker, ~25 min.
- prepare_caseops_data_parallel.py with --sp pointing at the new model
  produces SP10240 caseops shards (~27 GB). Uploaded to private HF
  dataset hf://FijaEE/parameter-golf-sp10240-caseops (1434 train + 5 val
  + 5 val_bytes shards).
- Tokenizer model + vocab file committed under tokenizers/ for git clone.

Phase K (TTT params budget tradeoff, ready to run):
- runpod/phase_k_ttt_tradeoff.sh: train SP8192 V2 baseline once on 8xH100
  (~10 min, saves model.bin), then run TTT_EVAL_ONLY=1 for 4 configs
  reusing the saved artifact:
    K0: grad=1 prefix=2000 phases=3 ctx=2048   (V2 baseline)
    K1: grad=2 prefix=2000 phases=3 ctx=2048   (oracle, expected over-budget)
    K2: grad=2 prefix=1500 phases=1 ctx=2048   (cut prefix+phases)
    K3: grad=2 prefix=2000 phases=3 ctx=1024   (cut ctx)
  Auto-picks the lowest-BPB config that fits 600s for Phase L.

Phase L (3-seed combo, parametrized by Phase K winner):
- runpod/phase_l_combo.sh: PR openai#1797 V2 stack + SP10240 + LoRA rank 96 +
  best TTT params from K. Runs 3 seeds (42, 314, 1234), reports Welch
  t-test vs PR openai#1797 (1.06157±0.00066) and the 0.005-nat record bar.

Hypothesis (per user observation): vocab progression 1024→2048→4096→8192
has been monotonically beneficial; no one in the queue has tried sp10240
without PPM-D. PR openai#1814's lowercase-SP10240 single-seed (1.0742) suggests
~ -0.0015 BPB delta from vocab alone vs PR openai#1797's V2 SP8192 baseline
(1.05998 seed-42). Combined with TTT 2-step bump (PR openai#1812 showed 4-epoch
delivered -0.008 BPB on a different stack) and LoRA rank 96, total
expected ~1.045-1.055 BPB if Phase K finds a feasible budget.
sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Apr 29, 2026
… new SOTA 1.0608 imminent; PPM-D concerns raised; final day

- Discovered organizer has 2 pending branches staging 14 new leaderboard records
- BOS-fix branch confirms CaseOps LEGAL (PRs openai#1729/openai#1736/openai#1769/openai#1787 included as records)
- New SOTA when merged: 1.0608 (codemath3000, PR openai#1855); new target ≤1.0558
- Tap-In V6 (PR openai#1518) confirmed legal by organizer branch inclusion
- PPM-D: @valerio-oai raised concerns on PR openai#1835 (3M/40.5M partial data + autoregressivity); do not implement
- SmearGate BOS fix required (top entry PR openai#1855 uses it)
- Updated CLAUDE.md competition strategy + added Session 24 lessons learned
- Added Apr 29 daily research log entry

https://claude.ai/code/session_01AAiiKSwWxDtGTexxogAkeZ
@cocohearts cocohearts merged commit 11bae1d into openai:main Apr 29, 2026
hilbertmeng pushed a commit to hilbertmeng/parameter-golf that referenced this pull request Apr 30, 2026
… — val_bpb 1.06549

3-seed mean 1.06549 (std 0.00070) on 8×H100 SXM, all gates green:
- artifact 15,975,120 bytes mean (≤16,000,000 DECIMAL)
- train_time 596.14s mean (≤600s)
- total_eval_time 397.23s mean (≤600s)

Builds on PR openai#1530 SP8192 stack. Adopts CaseOps (lossless_caps_caseops_v1)
bijective case preprocessing from PR openai#1729 with a per-token byte sidecar
so BPB is scored on original pre-transform UTF-8 bytes. Adds a learned
attention out-gate (init_std=0.005) + quant-gate scaling that recovers
the ~40 KB of overhead introduced by the new control tokens, keeping
every seed under the 16 MB decimal cap.

Seeds: 42 (1.06610), 0 (1.06473), 1234 (1.06563).
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request May 1, 2026
Audits every CaseOps-lineage record-track PR (merged + unmerged) since
2026-04-18 for whether val docs are also in the training set.

Working set: 34 PRs (31 from chronological seed list + 3 discovered ancestors:
openai#1908, openai#1923, openai#2007). Boundary nodes openai#1493 / openai#1626 (pre-CaseOps).

Verdicts:
  - CLEAN (8): openai#1729, openai#1851, openai#1868, openai#1908, openai#2019, openai#2027, openai#2031, openai#2068
  - LEAK (25): openai#1736 (our research baseline) → openai#1769openai#1787openai#1797openai#1855 → V21 family (openai#1945, openai#1923, openai#1953, openai#1967) → openai#2018openai#2118
    (current claimed frontier 1.04350), plus siblings.
  - INHERIT (1): openai#2050 (eval-only on frozen openai#1915)

Code-level evidence (not README claims):
  - Every shipped prepare_caseops_data.py is byte-identical:
    SHARD_TOKENS=10_000_000, default=10_000 for --val-docs
  - NO PR overrides --val-docs (searched all .sh files in all 34 PRs)
  - cached_challenge_fineweb.py downloads from romeerp/parameter-golf-caseops-v1
    HF dataset whose manifest pins docs_val=50000, docs_train=8181945,
    sums match → CLEAN by construction
  - PR openai#2018's DATASET_AUDIT.md is gold-standard explicit leak description
  - PR openai#2118's submission.json admits "--val-docs=10000 train shards + 50k val eval"

Three signposts:
  - Leak introduced: PR openai#1736 by @dexhunter (Apr 19) — first prepare_caseops_data.py
    default invocation
  - Leak fixed: PR openai#1851 by @aquariouseworkman (Apr 27) — switched to HF dataset
  - Leak re-introduced: PR openai#1855 by @codemath3000 (same day) — rebuilt locally

The merged-leaderboard SOTA (openai#1851/openai#1868 at 1.06128/1.06141) is CLEAN.
The unmerged frontier (openai#2118 at 1.04350) is LEAK. The 0.018 bpb gap is
inflated by val memorization; spec 301 was designed to measure how much
remains under clean data.

Files:
  caseops-memory-leakage/README.md       — overview, methodology, takeaways
  caseops-memory-leakage/verdicts.md     — 34-row master table with evidence
  caseops-memory-leakage/family-tree.md  — ASCII trees with [C]/[L] annotations
anmarhindi added a commit to anmarhindi/parameter-golf that referenced this pull request May 2, 2026
…0.979556)

The cond-PPM mixer used SP-piece UTF-8 bytes (incl. CaseOps sentinel
overhead, 164,594,398 per seed) as the BPB denominator instead of the
canonical raw-text sidecar (151,074,309 per seed) used by every other
CaseOps-lineage record per PR openai#1729 convention. Reported by @codemath3000
on PR openai#2138; thank you.

Per-token NLL is invariant under denominator change, so the correction
is algebraic — no re-eval required, original artifact and logs preserved
as forensic record. New per-seed BPB = old × 164594398 / 151074309 =
old × 1.089493:

  seed 42:   0.97949078 -> 1.067148
  seed 1337: 0.97954725 -> 1.067210
  seed 314:  0.97962885 -> 1.067299
  mean:      0.979556   -> 1.067219  (std ~7.6e-05)

On the canonical denominator the submission is +0.006 BPB worse than
PR openai#1855 SOTA (1.06108), so this is no longer a SOTA-claim. LBM still
gives a real -0.034 BPB improvement over sliding-window-alone (1.101347)
on the canonical denominator; the C2-correctness story is unchanged.

This commit only patches interpretation:
  - README.md: prepend Errata section, corrected 3-seed table, source-
    line citations, algebraic derivation; reposition writeup as
    not-SOTA. Original technique writeup retained below.
  - submission.json: corrected val_bpb / val_bpb_per_seed / std /
    eval_canonical_byte_count_per_seed / headline_metric_description;
    add errata{} object with summary, original values, inflation ratio,
    credit, fix-branch pointer.

Forensic items deliberately untouched: train_gpt.py (wrapped, contains
buggy denominator), final_model.int6.ptz, train_seed*.log (each shows
both the buggy 'cond_ppm bytes=164594398' line and the canonical-
correct 'quantized_sliding_window val_bpb' line — the sidecar count
151,074,309 is reverse-solvable from the latter).

Fix lives on cond-ppm-stack of github.com/anmarhindi/parameter-golf-a.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants