Record: CaseOps Tokenizer + Tapered WD - val_bpb 1.0678 (3-seed mean) by romeerp · Pull Request #1729 · openai/parameter-golf

romeerp · 2026-04-19T02:12:48Z

Summary

val_bpb: 1.06780 (3-seed mean, std 0.00037) | ~15.94 MB | 8xH100 80GB SXM
Builds on PR Record: VarLen Attention + Fused MLP + Multi-Phase Global SGD TTT — val_bpb 1.07193 (3-seed mean) #1626's legal multi-phase TTT stack
Replaces standard sp8192 with a lossless CaseOps tokenizer / dataset export hosted publicly at romeerp/parameter-golf-caseops-v1
Adds a mild late Muon WD taper: WD_TAPER_START_FRAC=0.70, WD_TAPER_FINAL_MULT=0.50
Keeps phased TTT legal and score-first while improving pretrained, quantized, and post-TTT BPB

Results

Seed	Steps	Pre-Quant BPB	Quantized BPB	Post-TTT BPB	Artifact
0	4,921	1.07032992	1.08152131	1.06805820	15,932,307
42	4,866	1.07065549	1.08171495	1.06806595	15,935,802
1234	4,870	1.06971629	1.08036614	1.06727867	15,943,106
Mean		1.07023390	1.08120080	1.06780094	15,937,072

All 3 seeds are under the 600s train budget, under the 600s eval budget, and under the 16 MB artifact cap.

Method

This submission combines two ideas:

Lossless CaseOps tokenizer
- Uses the reversible lossless_caps_caseops_v1 transform.
- Factorizes text into a lowercase lexical stream plus a tiny capitalization side channel (TITLE, ALLCAPS, CAPNEXT, ESC).
- Original text is reconstructed exactly by replaying those capitalization operators over the lowercase stream, so no information is discarded.
- This reduces redundant case fragmentation in the main token stream while preserving exact recoverability.
- Validation BPB is still charged against exact original UTF-8 bytes using exported validation byte sidecars.
Mild tapered weight decay
- Keeps full Muon WD early in training when regularization/compressibility pressure matters most.
- Tapers to half the base WD late in training, where weights are more settled and the optimization benefit appears to outweigh the regularization benefit.

The base architecture and legal phased-TTT evaluation flow come from PR #1626; this submission changes the tokenizer/data path and late WD schedule.

Why the tokenizer is still metric-correct

The exporter writes:

fineweb_val_000000.bin
fineweb_val_bytes_000000.bin

The trainer loads the byte sidecar directly and logs:

val_bpb:byte_sidecar:enabled

So BPB is computed against original raw-byte counts, not the transformed token stream length.

Legality

Attention remains causal.
Scoring uses standard normalized cross-entropy.
Phased TTT remains score-first in the PR Record: VarLen Attention + Fused MLP + Multi-Phase Global SGD TTT — val_bpb 1.07193 (3-seed mean) #1626 sense.
No validation token is used for adaptation before its score is counted.
The tokenizer change is legality-preserving because it is a fully reversible preprocessing transform applied uniformly before tokenization.
No information is discarded: the original text can be reconstructed exactly from the lowercase stream plus the capitalization operators.
BPB is still charged against the original raw UTF-8 bytes through the exported validation byte sidecar, not against transformed text length.

Reproducibility

Public artifacts:

Dataset + tokenizer: romeerp/parameter-golf-caseops-v1

The record folder includes:

train_gpt.py
README.md
submission.json
requirements.txt
cached_challenge_fineweb.py
download_hf_docs_and_tokenize.py
lossless_caps.py
tokenizer_specs_export_caseops_v1_reserved_only.json
all 3 seed logs

Run instructions

From the record directory:

cd records/track_10min_16mb/2026-04-18_PR1626_CaseOps_Taper

Prepare the public HF tokenizer + dataset:

MATCHED_FINEWEB_REPO_ID=romeerp/parameter-golf-caseops-v1 \
MATCHED_FINEWEB_REMOTE_ROOT_PREFIX=datasets \
python3 cached_challenge_fineweb.py \
  --variant sp8192_lossless_caps_caseops_v1_reserved \
  --train-shards 80

Train + quantize + phased eval for one seed:

NCCL_NET=Socket \
SEED=0 \
TOKENIZER_PATH=./tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model \
DATASETS_DIR=./datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved \
torchrun --standalone --nproc_per_node=8 train_gpt.py \
  > train_seed0.log 2>&1

Round35 W99 was replaying standard sp8192 because the worker evaluator hard-coded the generic SP8192 downloader and tokenizer path. The worker now detects the CaseOps spec, downloads the romeerp HF export, enables validation byte sidecars, and points train_gpt at the lossless CaseOps dataset/tokenizer surface. Constraint: W99 must match PR openai#1729's public CaseOps dataset/tokenizer path closely enough to make the replay meaningful Rejected: Keep relaunching the generic SP8192 surface | it cannot validate the CaseOps claim Confidence: high Scope-risk: narrow Directive: If a future tokenizer lane ships a spec file plus custom downloader surface, patch evaluator data setup before trusting any replay Tested: python3 -m py_compile evaluate.py train_gpt.py data/cached_challenge_fineweb.py Not-tested: End-to-end HF CaseOps download on Lepton

… — val_bpb 1.06549 3-seed mean 1.06549 (std 0.00070) on 8×H100 SXM, all gates green: - artifact 15,975,120 bytes mean (≤16,000,000 DECIMAL) - train_time 596.14s mean (≤600s) - total_eval_time 397.23s mean (≤600s) Builds on PR openai#1530 SP8192 stack. Adopts CaseOps (lossless_caps_caseops_v1) bijective case preprocessing from PR openai#1729 with a per-token byte sidecar so BPB is scored on original pre-transform UTF-8 bytes. Adds a learned attention out-gate (init_std=0.005) + quant-gate scaling that recovers the ~40 KB of overhead introduced by the new control tokens, keeping every seed under the 16 MB decimal cap. Seeds: 42 (1.06610), 0 (1.06473), 1234 (1.06563).

Adds byte sidecar loading to enable CaseOps lossless-case tokenizer (PR openai#1729). Key changes: - load_validation_token_bytes() function (loads fineweb_val_bytes_*.bin) - ValidationData.val_token_bytes field with sidecar fallback to LUT - eval_val/eval_val_sliding/eval_val_ttt prefer sidecar when available - TTT_EMA_ENABLED default 1 -> 0 (V14 lesson: EMA hurts monotonic-decrease TTT) V14 EMA result: 1.0427 (worse than baseline due to monotonic TTT loss). V15 hypothesis: CaseOps gives -0.005 to -0.012 BPB by saving bits via case dedup, landing in 1.030-1.038 range (50% chance of breaking record at 1.0357).

…ad, MP-SGD TTT 4-phase - PR openai#1698 (GDN FLA, claimed 1.00995): BPB bug confirmed by dexhunter (~1.189 actual) + artifact size violation; effectively dead - New technique: CaseOps bijective tokenizer (PR openai#1729/openai#1736/openai#1738) — reversible case-factoring with byte sidecar; stronger legality than casefold; await Issue openai#1604 ruling - PR openai#1735 (pre-quant TTT 21ep) flagged illegal by dexhunter; PR openai#1738 builds on it, both likely void - PR openai#1727 (MP-SGD TTT 4 phases, 1.07217): appears legal, stackable - Merged SOTA 1.0810 Day 10 plateau; 11 days to deadline https://claude.ai/code/session_012mo6412sGQRVjF7TDmfx31

…approach Read openai#1695's diff. Their approach is fundamentally different from the static-weight-rotation + folds design I had in mind for 'full' mode. They do ONLINE activation rotation: 4 global Hadamard rotations inserted as x @ R matmuls at 4 forward-pass sites (residual->qkv, attn-out->proj, residual->fc, hidden->proj). GPTQ then quantizes in the rotated basis; rotated Hessians keep the quant-side accounting honest. Rotations OFF during training, ON after deserialize for eval+TTT. Why this matters: their scheme sidesteps BOTH blockers that made the full mode complicated: - LeakyReLU non-equivariance: R_mlp_proj_in is applied AFTER the LeakyReLU-square, not across it. - resid_mix: rotations are per-linear-input, never touch the residual stream. All per-channel multipliers (attn_scale, mlp_scale, resid_mix, skip_weights) operate in unchanged basis. No float invariance — the model IS different post-rotation. The bet is that the rotated-basis GPTQ delivers lower quant error and that the perturbation is smaller than the savings. Implication: deprecate the 'full' static-rotation-with-folds plan in favor of a future 'port_1695' spec that ports their online scheme. Internal_only mode from spec 009 remains useful as an independent data point (R_a only, fp-invariant). Spec 010 (tapered WD) drafted as an independent parallel track: - Ports PR openai#1729's WD_TAPER_START_FRAC=0.70, WD_TAPER_FINAL_MULT=0.50 - Muon-WD-only taper on top of openai#1736's existing schedule - Full retrain on 8xH100, single seed, ~\$20 - Independent of spec 009 (different pod, no shared state) - Can run in parallel with 009's eval-only sweep Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

User flagged that port_1695 should be the next spec (higher-impact, natural follow-up to 009) rather than tapered WD. Reshuffled: - 010-port-1695-online-rotation.md (NEW) — port openai#1695's online Hadamard rotation with rotated-basis GPTQ. Hotstart off spec 008 pre_gptq.pt. Expected Delta -0.003 to -0.005 bpb vs spec 009 baseline. ~\$10, 8xH100. - 011-tapered-wd.md (renumbered from 010) — Muon WD taper from openai#1729. Full retrain, ~\$20. Independent of specs 009/010, can run in parallel. Spec 010 inherits the design analysis from research/ideas/ spinquant-integration-notes.md (addendum section). Depends on spec 009 baseline measurement for apples-to-apples Delta. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Bundles three orthogonal training-time levers into one retrain: - tapered Muon WD (port openai#1729, originally spec 011) - GradPower p=0.9 (port openai#1682) - softer QK_GAIN init 5.0 → 2.5 (port openai#1648, simplified from per-layer convergence) Code patch at exp/training-bundle (commit 8d54854). All env-gated with no-op defaults. Supersedes spec 011 which is kept as a design-doc reference. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Four new env-gated hyperparameters, all default to no-op so spec 008 is byte-identical when the vars are unset: - WD_TAPER_START_FRAC / WD_TAPER_FINAL_MULT (port openai#1729): linear Muon WD taper from 1.0 at start_step to final_mult at h.iterations. Applied in step_fn before optimizers.step. Adam/embed WD untouched per openai#1729. - MUON_GRAD_POWER (port openai#1682): g = sign(g) * |g|^p, applied to Muon gradients just before the momentum buffer update. Covers both sharded (shard path) and non-sharded paths. - QK_GAIN_INIT (existing): already present, lowering default not changed; setting QK_GAIN_INIT=2.5 at runtime gives uniform softer attention per openai#1648's convergence finding. - QK_GAIN_PER_LAYER (new): comma-sep list, overrides each block's attn.q_gain after block construction. Validated to match num_layers. Also: one startup log line echoing the four values for post-hoc verification. Spec: research/specs/012-training-bundle.md. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

…_data.py The shipped `_token_original_byte_counts` used a try/except surface-walk that attributed 3 UTF-8 bytes to bare U+E001..U+E004 operator pieces AND failed to advance `cursor_o`, over-counting validation bytes by ~8.37% on FineWeb. The training sidecar actually used (built from a different internal path via `surface_piece_original_byte_counts`) is correct, so the submitted 1.06157 / 1.06549 metrics are unaffected — but the shipped prep script could not reproduce the sidecar from a cold checkout. Swap the buggy inline walker for a direct delegation to `surface_piece_original_byte_counts` from `lossless_caps.py` (the same canonical exporter used by PR openai#1729 / the HF-hosted dataset). Verified on 500 FineWeb val docs: patched output matches the shipped sidecar token-for-token (0 mismatches) and byte-sum matches true UTF-8 exactly. Also clean up README prose for the 04-24 record: SmearGate is a gate on the first GATE_WINDOW=12 feature dims of x_t adding a 1-token causal lookback (not a 12-token residual window); LQER asymmetric stores A as INT2 per-matrix and B as INT4 per-group-64 and selects K=3 whole tensors globally (not per-row output columns).

@cocohearts

…symmetric + Phased TTT val_bpb = 1.06128 | ~15.95 MB | 8xH100 SXM Key Change: SmearGate BOS Document Boundary Fix Builds on PR openai#1797 stack (PR openai#1787 base + SmearGate + LQER Asymmetric) but fixes the SmearGate cross-document leakage bug identified by @cocohearts in PR openai#1797 audit. The bug: SmearGate 1-token causal lookback does not mask BOS positions, so the final token of document N smears into BOS of document N+1. Credits @nprime06 -- PR openai#1787 base stack @romeerp -- CaseOps transform (PR openai#1729) @dexhunter -- SmearGate + LQER (PR openai#1797) @cocohearts -- Identifying SmearGate BOS bug @abaybektursun -- Score-first TTT (PR openai#549) @clarkkev -- GPTQ SDClip + SP8192 (PR openai#1394)

Phase J (one-time data prep, done): - train_sp10240_caseops.py: train SentencePiece BPE at vocab=10240 over CaseOps-transformed FineWeb. Reserves U+E001..U+E005 as user-defined symbols (matches PR openai#1729 / SP8192 reservation set). 96-worker, ~25 min. - prepare_caseops_data_parallel.py with --sp pointing at the new model produces SP10240 caseops shards (~27 GB). Uploaded to private HF dataset hf://FijaEE/parameter-golf-sp10240-caseops (1434 train + 5 val + 5 val_bytes shards). - Tokenizer model + vocab file committed under tokenizers/ for git clone. Phase K (TTT params budget tradeoff, ready to run): - runpod/phase_k_ttt_tradeoff.sh: train SP8192 V2 baseline once on 8xH100 (~10 min, saves model.bin), then run TTT_EVAL_ONLY=1 for 4 configs reusing the saved artifact: K0: grad=1 prefix=2000 phases=3 ctx=2048 (V2 baseline) K1: grad=2 prefix=2000 phases=3 ctx=2048 (oracle, expected over-budget) K2: grad=2 prefix=1500 phases=1 ctx=2048 (cut prefix+phases) K3: grad=2 prefix=2000 phases=3 ctx=1024 (cut ctx) Auto-picks the lowest-BPB config that fits 600s for Phase L. Phase L (3-seed combo, parametrized by Phase K winner): - runpod/phase_l_combo.sh: PR openai#1797 V2 stack + SP10240 + LoRA rank 96 + best TTT params from K. Runs 3 seeds (42, 314, 1234), reports Welch t-test vs PR openai#1797 (1.06157±0.00066) and the 0.005-nat record bar. Hypothesis (per user observation): vocab progression 1024→2048→4096→8192 has been monotonically beneficial; no one in the queue has tried sp10240 without PPM-D. PR openai#1814's lowercase-SP10240 single-seed (1.0742) suggests ~ -0.0015 BPB delta from vocab alone vs PR openai#1797's V2 SP8192 baseline (1.05998 seed-42). Combined with TTT 2-step bump (PR openai#1812 showed 4-epoch delivered -0.008 BPB on a different stack) and LoRA rank 96, total expected ~1.045-1.055 BPB if Phase K finds a feasible budget.

@valerio-oai

… new SOTA 1.0608 imminent; PPM-D concerns raised; final day - Discovered organizer has 2 pending branches staging 14 new leaderboard records - BOS-fix branch confirms CaseOps LEGAL (PRs openai#1729/openai#1736/openai#1769/openai#1787 included as records) - New SOTA when merged: 1.0608 (codemath3000, PR openai#1855); new target ≤1.0558 - Tap-In V6 (PR openai#1518) confirmed legal by organizer branch inclusion - PPM-D: @valerio-oai raised concerns on PR openai#1835 (3M/40.5M partial data + autoregressivity); do not implement - SmearGate BOS fix required (top entry PR openai#1855 uses it) - Updated CLAUDE.md competition strategy + added Session 24 lessons learned - Added Apr 29 daily research log entry https://claude.ai/code/session_01AAiiKSwWxDtGTexxogAkeZ

… — val_bpb 1.06549 3-seed mean 1.06549 (std 0.00070) on 8×H100 SXM, all gates green: - artifact 15,975,120 bytes mean (≤16,000,000 DECIMAL) - train_time 596.14s mean (≤600s) - total_eval_time 397.23s mean (≤600s) Builds on PR openai#1530 SP8192 stack. Adopts CaseOps (lossless_caps_caseops_v1) bijective case preprocessing from PR openai#1729 with a per-token byte sidecar so BPB is scored on original pre-transform UTF-8 bytes. Adds a learned attention out-gate (init_std=0.005) + quant-gate scaling that recovers the ~40 KB of overhead introduced by the new control tokens, keeping every seed under the 16 MB decimal cap. Seeds: 42 (1.06610), 0 (1.06473), 1234 (1.06563).

@dexhunter

Audits every CaseOps-lineage record-track PR (merged + unmerged) since 2026-04-18 for whether val docs are also in the training set. Working set: 34 PRs (31 from chronological seed list + 3 discovered ancestors: openai#1908, openai#1923, openai#2007). Boundary nodes openai#1493 / openai#1626 (pre-CaseOps). Verdicts: - CLEAN (8): openai#1729, openai#1851, openai#1868, openai#1908, openai#2019, openai#2027, openai#2031, openai#2068 - LEAK (25): openai#1736 (our research baseline) → openai#1769 → openai#1787 → openai#1797 → openai#1855 → V21 family (openai#1945, openai#1923, openai#1953, openai#1967) → openai#2018 → openai#2118 (current claimed frontier 1.04350), plus siblings. - INHERIT (1): openai#2050 (eval-only on frozen openai#1915) Code-level evidence (not README claims): - Every shipped prepare_caseops_data.py is byte-identical: SHARD_TOKENS=10_000_000, default=10_000 for --val-docs - NO PR overrides --val-docs (searched all .sh files in all 34 PRs) - cached_challenge_fineweb.py downloads from romeerp/parameter-golf-caseops-v1 HF dataset whose manifest pins docs_val=50000, docs_train=8181945, sums match → CLEAN by construction - PR openai#2018's DATASET_AUDIT.md is gold-standard explicit leak description - PR openai#2118's submission.json admits "--val-docs=10000 train shards + 50k val eval" Three signposts: - Leak introduced: PR openai#1736 by @dexhunter (Apr 19) — first prepare_caseops_data.py default invocation - Leak fixed: PR openai#1851 by @aquariouseworkman (Apr 27) — switched to HF dataset - Leak re-introduced: PR openai#1855 by @codemath3000 (same day) — rebuilt locally The merged-leaderboard SOTA (openai#1851/openai#1868 at 1.06128/1.06141) is CLEAN. The unmerged frontier (openai#2118 at 1.04350) is LEAK. The 0.018 bpb gap is inflated by val memorization; spec 301 was designed to measure how much remains under clean data. Files: caseops-memory-leakage/README.md — overview, methodology, takeaways caseops-memory-leakage/verdicts.md — 34-row master table with evidence caseops-memory-leakage/family-tree.md — ASCII trees with [C]/[L] annotations

@codemath3000

…0.979556) The cond-PPM mixer used SP-piece UTF-8 bytes (incl. CaseOps sentinel overhead, 164,594,398 per seed) as the BPB denominator instead of the canonical raw-text sidecar (151,074,309 per seed) used by every other CaseOps-lineage record per PR openai#1729 convention. Reported by @codemath3000 on PR openai#2138; thank you. Per-token NLL is invariant under denominator change, so the correction is algebraic — no re-eval required, original artifact and logs preserved as forensic record. New per-seed BPB = old × 164594398 / 151074309 = old × 1.089493: seed 42: 0.97949078 -> 1.067148 seed 1337: 0.97954725 -> 1.067210 seed 314: 0.97962885 -> 1.067299 mean: 0.979556 -> 1.067219 (std ~7.6e-05) On the canonical denominator the submission is +0.006 BPB worse than PR openai#1855 SOTA (1.06108), so this is no longer a SOTA-claim. LBM still gives a real -0.034 BPB improvement over sliding-window-alone (1.101347) on the canonical denominator; the C2-correctness story is unchanged. This commit only patches interpretation: - README.md: prepend Errata section, corrected 3-seed table, source- line citations, algebraic derivation; reposition writeup as not-SOTA. Original technique writeup retained below. - submission.json: corrected val_bpb / val_bpb_per_seed / std / eval_canonical_byte_count_per_seed / headline_metric_description; add errata{} object with summary, original values, inflation ratio, credit, fix-branch pointer. Forensic items deliberately untouched: train_gpt.py (wrapped, contains buggy denominator), final_model.int6.ptz, train_seed*.log (each shows both the buggy 'cond_ppm bytes=164594398' line and the canonical- correct 'quantized_sliding_window val_bpb' line — the sidecar count 151,074,309 is reverse-solvable from the latter). Fix lives on cond-ppm-stack of github.com/anmarhindi/parameter-golf-a. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Record: CaseOps Tokenizer + Tapered WD - val_bpb 1.0678 (3-seed mean)

59b55a5

romeerp force-pushed the codex/caseops-pr1626-taper branch from dc1b643 to 59b55a5 Compare April 19, 2026 02:15

romeerp marked this pull request as ready for review April 19, 2026 02:26

dexhunter mentioned this pull request Apr 19, 2026

Record: SP8192 + CaseOps + GatedAttn + QuantGate + Loop45 + PhasedTTT — val_bpb 1.06549 #1736

Merged

5 tasks

alertcat mentioned this pull request Apr 19, 2026

Record: PR #1735 + CaseOps Tokenizer V15 (val_bpb 1.03540, mean of 3 seeds) #1738

Open

OE-GOD mentioned this pull request Apr 20, 2026

Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT + CaseOps Tokenizer — val_bpb 1.07462 (3-seed mean) #1755

Open

romeerp mentioned this pull request Apr 20, 2026

Record: CaseOps Tokenizer + Recurrence Depth Curriculum + Base Arch Stack — val_bpb 1.06505 #1756

Open

kilojoules mentioned this pull request Apr 21, 2026

Record: PR #1738 + PreQuant TTT LR=1e-3 + Unfrozen — val_bpb 1.02840 (3-seed mean) #1758

Open

5 tasks

tashapais mentioned this pull request Apr 22, 2026

SP8192 + CaseOps + Loop345 + Recur-Alpha + PhasedTTT #1766

Open

5 tasks

dexhunter mentioned this pull request Apr 24, 2026

Record: PR #1787 base + Smear Gate + LQER Asym — val_bpb 1.06157 #1797

Open

8 tasks

davie2009kh mentioned this pull request Apr 24, 2026

Record attempt: SP8192 + 3-Epoch Parallel Pre-Quant TTT + Huber WD Muon (SDPA-friendly) — val_bpb 1.07037 (3-seed mean) #1807

Open

aquariouseworkman mentioned this pull request Apr 27, 2026

Record: val_bpb = 1.06128 SmearGate BOS Fix + PR #1787 Base + Smear Gate + LQER Asymmetric + Phased TTT (indirect 3 seed mean) #1851

Merged

ndokutovich mentioned this pull request Apr 27, 2026

Record: PR #1797 base + PPM-D byte mixture — val_bpb 0.90236 (3-seed mean) #1854

Open

Christopher-Lee-McClendon mentioned this pull request Apr 27, 2026

Record: SmearGate BOS Fix 3-Seed Compliance Re-run — val_bpb 1.06141 (3-seed mean) #1868

Merged

ndokutovich mentioned this pull request Apr 28, 2026

Record: PR #1797 base + PPM-D byte mixture (v2, full-val coverage) — mix_bpb 0.9019 / quantized_ttt 1.0621 #1881

Open

X-Abhishek-X mentioned this pull request Apr 28, 2026

Record: Partial SpinQuant (start_layer=5) + PR#1851 Stack — val_bpb 1.06614 (3-seed mean) #1898

Open

cocohearts mentioned this pull request Apr 28, 2026

Update Parameter Golf leaderboard with BOS fix #1902

Merged

AayushBaniya2006 mentioned this pull request Apr 28, 2026

Record: PR #1797 reproduction — val_bpb 1.06136 (3-seed mean) #1906

Open

6 tasks

dttdrv mentioned this pull request Apr 28, 2026

{RECORD} CaseOps pre-quant TTT record (1.0354 BPB) #1911

Open

8 tasks

liujshi mentioned this pull request Apr 29, 2026

Record: SP8192 CaseOps + TTT + GPTQ + LRZIP — val_bpb 1.05993 (3-seed mean) #1934

Open

5 tasks

cocohearts merged commit 11bae1d into openai:main Apr 29, 2026

alertcat mentioned this pull request Apr 29, 2026

Record: V22 = V21 + PR #1953 levers + EVAL_SEQ_LEN=2816 -- val_bpb 1.05877 (3-seed mean, all strict <600s) #1945

Open

12 tasks

TimS-ml mentioned this pull request Apr 29, 2026

Record: Leaky ReLU Slope + GPTQ Reverse-Cholesky Speedup + PR #1938 (val_bpb = 1.06242) #1948

Open

Christopher-Lee-McClendon mentioned this pull request Apr 29, 2026

Record: Compliant PR #1934 Reproduction (GPTQ_RESERVE=5.5) — val_bpb 1.06003 (3-seed) #1950

Open

andrewbaggio1 mentioned this pull request Apr 30, 2026

Record: PR #1945 base + 2560 long-context + no_qv TTT mask + TTT LR 0.75 + QK_GAIN 5.25 — val_bpb 1.05855 (3-seed mean) #1953

Open

10 tasks

AayushBaniya2006 mentioned this pull request Apr 30, 2026

Record: PR #1908 reproduction with compliant 600s wallclock — val_bpb 1.06044 (3-seed mean) #1956

Open

5 tasks

okezue mentioned this pull request Apr 30, 2026

Record: PreQuantTTT + Sliding Window on PR #1855 stack, val_bpb=1.01355 (3-seed) #1958

Closed

chris-colinsky mentioned this pull request Apr 30, 2026

Record candidate: PR #1855 + Adaptive Hessian-Sensitivity GPTQ Clip — val_bpb 1.06310 (3-seed mean) #1962

Open

6 tasks

bsisduck mentioned this pull request Apr 30, 2026

SP8192 CaseOps + WiderGate32 + GPTQ-int6 — val_bpb 1.08037 (3-seed mean) #1969

Open

EthanYangTW mentioned this pull request Apr 30, 2026

Non-record: GolfParty — composable scaffolding for every Requests-for-PRs item #1978

Open

6 tasks

TimS-ml mentioned this pull request Apr 30, 2026

Record: MHA Path + 1855 9-hparam Stack + PR #1948 + PR #1855 (val_bpb = 1.06184, 3-seed) #1987

Open

simonbissonnette mentioned this pull request Apr 30, 2026

Record: PR1855/PR1953 base + Progressive context growth (val_bpb: 1.05759, 3-seed) #2014

Open

This was referenced Apr 30, 2026

Record: SP8192 + Sliding-Window Eval + Conditional-PPM Byte Mixer - val_bpb 1.029282 #2032

Closed

Record: SP8192 + Sliding-Window Eval + Conditional-PPM Byte Mixer Full-Val - val_bpb 1.015784 #2039

Open

This was referenced May 1, 2026

Record: SP8192 CaseOps v13 PPM tuned gate — fresh 3-seed mean 0.94175270 #2083

Open

Record: BIJEPAX-lite JEPA + SP8192 CaseOps PPM — val_bpb 0.97271 #2080

Open

This was referenced May 1, 2026

record: AWQ-lite + AsymLogit + GradCentr + LabSmooth - 1.05846 BPB #2097

Closed

Record: AWQ-lite + AsymLogit + GradCentral + ... val_bpb=1.05845 #2101

Open

leon2k2k2k mentioned this pull request May 1, 2026

Train/val data leakage in CaseOps records — prepare_caseops_data.py default overlaps 80% of val docs with training data #2127

Open

codemath3000 mentioned this pull request May 1, 2026

Record: SP8192 + LQER + Sparse Attn Gate + BOS-Fixed SmearGate + 9-Hparam Greedy Stack — val_bpb 1.06108 (3-seed mean) #1855

Merged

5 tasks

okezue mentioned this pull request May 1, 2026

Non-record: Post-Quantization LoRA Distillation (LCQ) on PR #1855 stack, val_bpb=1.06767 #2128

Open

TanishGudise mentioned this pull request May 1, 2026

Record candidate: 1.05670 BPB — token-only n-gram tilt + AsymLogit + #2060 levers + NUM_PHASES=1 #2130

Open

anmarhindi mentioned this pull request May 1, 2026

Record: SP8192 + Sliding-Window Eval + Lock-In Byte Mixer - val_bpb 1.067219 #2138

Open

simon-marcus mentioned this pull request May 1, 2026

Corrected: PR #2014 stack + LeakyReLU 0.3 + token-only in-timer n-gram TTT (val_bpb 1.0570) #2140

Open

codemath3000 mentioned this pull request May 2, 2026

Record candidate: PR #2130 base + GPTQ_CALIBRATION_BATCHES=32 — val_bpb 1.05651 (3-seed mean) #2135

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: CaseOps Tokenizer + Tapered WD - val_bpb 1.0678 (3-seed mean)#1729

Record: CaseOps Tokenizer + Tapered WD - val_bpb 1.0678 (3-seed mean)#1729
cocohearts merged 1 commit intoopenai:mainfrom
romeerp:codex/caseops-pr1626-taper

romeerp commented Apr 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

romeerp commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Results

Method

Why the tokenizer is still metric-correct

Legality

Reproducibility

Run instructions

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

romeerp commented Apr 19, 2026 •

edited

Loading