Record: SP8192 + CaseOps + GatedAttn + QuantGate + Loop45 + PhasedTTT — val_bpb 1.06549#1736
Conversation
… — val_bpb 1.06549 3-seed mean 1.06549 (std 0.00070) on 8×H100 SXM, all gates green: - artifact 15,975,120 bytes mean (≤16,000,000 DECIMAL) - train_time 596.14s mean (≤600s) - total_eval_time 397.23s mean (≤600s) Builds on PR openai#1530 SP8192 stack. Adopts CaseOps (lossless_caps_caseops_v1) bijective case preprocessing from PR openai#1729 with a per-token byte sidecar so BPB is scored on original pre-transform UTF-8 bytes. Adds a learned attention out-gate (init_std=0.005) + quant-gate scaling that recovers the ~40 KB of overhead introduced by the new control tokens, keeping every seed under the 16 MB decimal cap. Seeds: 42 (1.06610), 0 (1.06473), 1234 (1.06563).
…ad, MP-SGD TTT 4-phase - PR openai#1698 (GDN FLA, claimed 1.00995): BPB bug confirmed by dexhunter (~1.189 actual) + artifact size violation; effectively dead - New technique: CaseOps bijective tokenizer (PR openai#1729/openai#1736/openai#1738) — reversible case-factoring with byte sidecar; stronger legality than casefold; await Issue openai#1604 ruling - PR openai#1735 (pre-quant TTT 21ep) flagged illegal by dexhunter; PR openai#1738 builds on it, both likely void - PR openai#1727 (MP-SGD TTT 4 phases, 1.07217): appears legal, stackable - Merged SOTA 1.0810 Day 10 plateau; 11 days to deadline https://claude.ai/code/session_012mo6412sGQRVjF7TDmfx31
Bulk import of dexhunter's openai#1736 unmerged submission (openai#1736, commit e100586) for reproduction as our new research baseline. Source: records/track_10min_16mb/ 2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT/. 9 files, ~6856 lines: - train_gpt.py (training script) - lossless_caps.py (bijective CaseOps transform) - prepare_caseops_data.py (data retokenization script) - fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model (SP tokenizer) - README.md, submission.json, 3 per-seed training logs No modifications to repo-root files. Spec: research/specs/008-1736-reproduction.md. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
After 2026-04-19 frontier scan, rebasing the research baseline from merged SOTA openai#1493 (1.0810) to unmerged PR openai#1736 (dexhunter, claimed 1.06549). Rationale: credible frontier moved ~0.015 bpb past merged SOTA in 10 days via witnessed, legal levers (CaseOps tokenizer, attn-out gate, phased TTT). Continuing off spec-000 leaves us behind before we try anything. - CLAUDE.md: baseline declared; baseline-migration specs land on research directly (exception to exp/<slug> convention). - research/frontier-map.md: credibility filter + dependency map. - diary/2026-04-19-frontier-{scan,map}.md: per-PR evidence base. - research/ideas/1736-improvement.md: three-spec migration plan. - research/specs/008-1736-reproduction.md: spec for the reproduction run, pinned to commit 154c9b8 (openai#1736 import at e100586). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Collapse spec 008 to seed 42 only and add a one-line pre-GPTQ FP checkpoint save at runs/008-1736-reproduction/seed_42/pre_gptq.pt (env-var gated via SAVE_PRE_GPTQ=1 so the reproduction itself is unaffected when the flag is off). Rationale: SpinQuant and subsequent quant-family experiments are purely post-training transforms, so hotstarting off a single pre-GPTQ FP checkpoint is far cheaper than retraining per spec. Single-seed comparison against openai#1736's seed-42 (1.06610, ±0.003) is apples-to-apples for screening. Cost drops ~$40 -> ~$17 for this spec and ~$10 -> ~$1–2 per downstream quant experiment. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
First lever layered on the new openai#1736 baseline. Hadamard rotation of weight matrices before GPTQ quantization, hotstarted off spec 008's pre_gptq.pt FP checkpoint. No retraining. Witnessed at claimed -0.005 bpb on PR openai#1695 (X-Abhishek-X) on a openai#1529-adjacent base; expected to compose cleanly with openai#1736 since the quant stage is orthogonal to CaseOps / attention gates / phased TTT. Rotation is a post-training transform with three classes (residual-stream, per-layer attn, per-layer MLP); FP forward pass is invariant by construction, only quantization error drops. Cost ~$6 (hotstart off spec 008 checkpoint), vs ~$30 for a full retrain. Same hotstart checkpoint reused by future quant experiments (per-group bit, AR-selfgen calib, AWQ). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Based on reading train_gpt.py at commit 154c9b8: Good: RMSNorm is gamma-free (line 529), so the usual gamma-fold step doesn't apply. RMSNorm is rotation-equivariant directly. Bad: openai#1736 has five OTHER per-channel multipliers on residual flow (attn_scale, mlp_scale, resid_mix, skip_weights, skip_gates). These are the real fold targets, not RMSNorm. resid_mix is pre-norm and cannot be cleanly folded. Split into three SpinQuant modes selectable by SPINQUANT_MODE: - internal_only (R_a, R_m per layer; no residual rotation) - full (internal + R0, with attn_scale/mlp_scale/skip folds and resid_mix freeze-to-mean compromise) - port_1695 (conditional on openai#1695 diff being meaningfully different) All three run back-to-back on one pod hotstarted off spec 008's final_model.pt. ~30 min total GPU, ~$17-22 budget, one eval. research/ideas/spinquant-integration-notes.md captures the full design analysis (per-multiplier fold feasibility, three-option tradeoff, shared-code plan). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Added SPINQUANT_MODE=baseline as a fourth variant that applies no rotation — just loads final_model.pt, runs serialize/deserialize/ eval/TTT on it. Two purposes: 1. Closes the loop on spec 008's missed post-TTT number (watcher stopped the pod before the TTT eval ran). No separate $3 eval-only rerun needed. 2. Provides the apples-to-apples local reference for measuring the three SpinQuant variants' Deltas — removes any cross-pod bf16 drift from the comparison. Order: baseline -> internal_only -> full -> port_1695, sequential on one pod. Gate: if baseline lands outside openai#1736's 1.06610 +/- 0.003, halt before running rotations (means spec 008 reproduction is off). Total cost ~$27 (was $22); absorbs ~$3 of otherwise-separate eval rerun, so net increment is ~$2 for four measured numbers. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Two new files in the openai#1736 submission dir: spinquant_hotstart.py (~360 LOC): - Imports from train_gpt.py for Hyperparameters/GPT/serialize/deserialize/ eval_val/eval_val_ttt_phased/BatchedTTTLoRA/etc. - Modes: baseline, internal_only (R_a only, per-layer per-KV-group, d_head rotation on V-output and O-input). - full, port_1695 are stubs — raise NotImplementedError with explanation. - Pipeline: load FP state_dict from HOTSTART_FP_CKPT -> apply rotations in-place on banked qo_bank/kv_bank -> optional pre-quant diagnostic eval -> call serialize() (GPTQ+compress) -> deserialize() -> quantized eval -> phased TTT eval -> write final.json. - Reproduces the TTT eval block from train_and_eval (lines 2997-3075) in _run_ttt_eval() rather than refactoring the source file. test_rotation_invariance.py (~250 LOC): - CPU-only, standalone (no train_gpt.py import due to flash_attn_3/triton module-level deps). - Self-contained minimal attention forward: Q/K/V projection from the banked tensors, RMSNorm on Q and K (matches real model's bound on attention logits; without this, trained weights saturate softmax and float noise in V amplifies catastrophically). - Tests baseline (bit-exact identity) and internal_only (rel tolerance 1e-4) against either synthetic random weights or spec 008's final_model.pt. Both pass cleanly (rel_max ~1e-6 on real checkpoint). - Can load either banked (qo_bank/kv_bank) or unbanked (blocks.N.attn.*.weight) state_dict format. Spec 009 updated: reduced scope to 2 modes (baseline, internal_only) for this session; full and port_1695 deferred. Rationale in the spec: MLP LeakyReLU-squared breaks R_m float-invariance, resid_mix can't be cleanly folded through RMSNorm, both needing design before implementation. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Cleanup pass to resolve inconsistencies between the spec and what's actually in spinquant_hotstart.py + test_rotation_invariance.py: - Title + scope: 2-mode sweep (baseline, internal_only); full and port_1695 explicitly deferred to a follow-up spec. - Checkpoint path: pre_gptq.pt (what execution's spec-008 patch produced, after _unbank_state_dict), not final_model.pt. - Accept criteria: preflight via test_rotation_invariance.py (ALL TESTS PASS), then per-mode on pod. - Rotation structure: trimmed to just the implemented R_a class with exact banked-tensor indexing. R_0 / R_m / skip-stream / RMSNorm-fold sections moved to 'not implemented (deferred)'. - RMSNorm-fold section removed entirely: openai#1736's RMSNorm is gamma-free (F.rms_norm with no weight arg), so no fold needed. - Code-changes section: points at the files on disk instead of TODO pseudocode. - Execution protocol: 2 modes back-to-back on 8xH100, explicit preflight step. - Hardware ladder: 8xH100 required (phased TTT is 8-rank DDP). - Cost estimate: ~$15 total for 2 modes. - Open questions: reframed around unbanked-checkpoint load, bf16 drift, GPTQ interaction, phased-TTT compatibility. - What this spec does NOT do: clarified that residual rotation, R_m, resid_mix, and port_1695 are all deferred. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…approach Read openai#1695's diff. Their approach is fundamentally different from the static-weight-rotation + folds design I had in mind for 'full' mode. They do ONLINE activation rotation: 4 global Hadamard rotations inserted as x @ R matmuls at 4 forward-pass sites (residual->qkv, attn-out->proj, residual->fc, hidden->proj). GPTQ then quantizes in the rotated basis; rotated Hessians keep the quant-side accounting honest. Rotations OFF during training, ON after deserialize for eval+TTT. Why this matters: their scheme sidesteps BOTH blockers that made the full mode complicated: - LeakyReLU non-equivariance: R_mlp_proj_in is applied AFTER the LeakyReLU-square, not across it. - resid_mix: rotations are per-linear-input, never touch the residual stream. All per-channel multipliers (attn_scale, mlp_scale, resid_mix, skip_weights) operate in unchanged basis. No float invariance — the model IS different post-rotation. The bet is that the rotated-basis GPTQ delivers lower quant error and that the perturbation is smaller than the savings. Implication: deprecate the 'full' static-rotation-with-folds plan in favor of a future 'port_1695' spec that ports their online scheme. Internal_only mode from spec 009 remains useful as an independent data point (R_a only, fp-invariant). Spec 010 (tapered WD) drafted as an independent parallel track: - Ports PR openai#1729's WD_TAPER_START_FRAC=0.70, WD_TAPER_FINAL_MULT=0.50 - Muon-WD-only taper on top of openai#1736's existing schedule - Full retrain on 8xH100, single seed, ~\$20 - Independent of spec 009 (different pod, no shared state) - Can run in parallel with 009's eval-only sweep Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…n sprint Session-narrative entry covering today's work: - Frontier filter + baseline migration from merged SOTA openai#1493 (1.0810) to unmerged openai#1736 (1.0655), rationale, CLAUDE.md update. - Spec 008 run partial result (training reproduced openai#1736 within +0.00016 at pre-quant; post-TTT gate number not captured due to watcher bug; projected pass ~1.06626). - Spec 009 design evolution through three scope cuts: 4 modes -> unified sweep -> +baseline mode -> cut to 2 modes after discovering real architectural blockers (MLP LeakyReLU breaks R_m, resid_mix doesn't fold cleanly). - openai#1695 diff discovery: they do online activation rotation, not static weight rotation. Sidesteps both LeakyReLU and resid_mix. Reframes 'full' mode -> port_1695 mode as the next quant-side spec. - Specs 010 (port_1695, design only) and 011 (tapered WD, design only) drafted. Only spec 009 is truly runnable right now. Closes with state-of-play table, modal plan, lessons-learned, and open questions for next session. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Implements the port_1695 SpinQuant variant from PR openai#1695 onto the openai#1736 stack. All changes env-var-gated (SPINQUANT_ENABLED=0 default) so spec 008 and spec 009's baseline/internal_only modes are unaffected bit-for-bit. train_gpt.py changes (+247 lines): - import hashlib - Hyperparameters.spinquant_enabled, spinquant_seed - CastedLinear._sq_active class flag (default False) - Utility block: _stable_seed, _hadamard_rotation, install_spinquant_ rotations, _SQ_KEY_TO_TAG, _spinquant_rotate_sd_and_H - 4 forward-path hook sites (2 each in CausalSelfAttention, MLP, _block_with_lora, _parallel_block_with_lora): - pre-QKV: x_qkv = x @ R_attn_in - pre-attn-proj: y @ R_attn_proj_in - pre-fc: x @ R_mlp_in - post-activation pre-proj: hidden @ R_mlp_proj_in - serialize(): call _spinquant_rotate_sd_and_H after Hessian collection and before GPTQ. Rotates weights (W @ R) and Hessians (R.T @ H @ R). - deserialize(): install_spinquant_rotations + set _sq_active=True after loading rotated weights. - MLP.forward: disable fused kernel when SpinQuant active. - LoRA (TTT path) uses unrotated n, base path uses rotated n_qkv. spinquant_hotstart.py changes: - port_1695 mode no longer raises NotImplementedError. Sets h.spinquant_enabled=True and h.spinquant_seed; train_gpt.py's machinery does the rest. Math: orthogonal R means R @ R.T == I, so x_rot @ W_rot = x @ R @ (W @ R).T = x @ R @ R.T @ W.T = x @ W.T. Pre-quant forward is bit-identical to unrotated; GPTQ sees rotated basis where outliers are spread more evenly and quantization error drops. Spec 010 doc updated to reflect the implementation state. Execution runs via SPINQUANT_MODE=port_1695 on spinquant_hotstart.py. Not tested on GPU — flash_attn_3 not available on the dev box. Syntax clean. First pod run will verify end-to-end behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Continuation of the morning diary. Covers: - Spec 009 baseline closed spec 008's gate at 1.06728 (matches openai#1736's 1.06610 within bf16 noise). internal_only null (+0.00003). - Spec 010 port_1695 also null aggregate (-0.00005), BUT per-batch analysis revealed a striking regime-dependent effect: rotation helps long-context docs (-0.0064 bpb on dl>1000) and hurts short-context docs (+0.0146 on dl<300). The null is a cancellation, not an absence of effect. - 'TTT substitutes for rotation' hypothesis revised — the rotation Delta is ~0 at both pre-TTT and post-TTT stages. What rotation actually does is shift where in the doc-length distribution the model is strong, without changing the aggregate. - Designed + implemented spec 010b (SPINQUANT_SITES env var) to isolate which sites (attn vs MLP) carry the help vs hurt. Ready for execution, ~\$25. - Lessons: look at per-batch trajectory data before concluding a null is null. Length-sorted running averages are systematically biased. Don't pivot prematurely from a signal you haven't fully interrogated. Still \$163 under project budget. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Closes the SpinQuant investigation arc with spec 010b's results and an honest retrospective on the false-signal episode. Key findings: - All 5 SpinQuant variants (baseline, internal_only, port_1695, attn_only, mlp_only) land within 0.00009 bpb at final val_bpb. Pure null. openai#1736 has seed std ~0.00070; we are 10x below that. - Pt.2's "regime-dependence is exploitable" hypothesis refuted. attn_only ≈ baseline on rank 0 (attention rotation does nothing); mlp_only has inverse regime from port_1695 (hurts long, helps short); neither subset comes close to port_1695's emergent rank-0 trajectory lead. - Rank-0 rb spread across variants: 0.0075 bpb. Final val_bpb spread across variants: 0.000085 bpb. 80x compression from 8-rank aggregation + TTT LoRA uniform absorption. Mistake I owned up to: read rank-0 rb:1.0657 for mlp_only at batch 780 and suggested "mlp_only might actually net positive." Final.json came out +0.000005 above baseline. Rank-0 rb is rank 0's 1/8 slice, not a preview of the submission number. Methodology corrections for future runs: - Always check final.json before any trend interpretation - Rank-0 rb is a progress indicator, not a metric preview - When pre-TTT diagnostic_quantized spread < 0.001, post-TTT will be near-identical (TTT LoRA dominates) Budget: spent ~\$52 of \$200 total. 10 days left. Next: spec 011 (tapered Muon WD retrain) — upstream of TTT, might unlock something TTT can't absorb. Patch still unwritten. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Scanned 200 PRs from 2026-04-11..20. After exclusion filters, 3 candidates beat spec 011's expected Δ: openai#1682 (GradPower Muon p=0.9), openai#1648 (xIELU + per-layer QK gain), openai#1555 (Tap-In eval cache). Full artifacts at ~/competition-pr/pr-scan-2026-04-20/. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…1716) Two orthogonal training-time levers queued behind spec 011: - bpb-weighted-loss.md (port openai#1519): weight CE by UTF-8 bytes per token. Aligns training objective with eval metric. Risk: SP8192 vocab destabilization (author warns on large vocabs) + CaseOps byte LUT accounting (~1hr of careful code). - bigram-hash-embed.md (port openai#1716): 16384×32 hash-table bigram embed added to token embedding pre-block-0. ~540K params / ~400KB artifact. openai#1736 genuinely lacks this despite prevalence in competitive lineages. Recommended sequencing: 011 → 012 (QK) → 013 (BigramHash, lower risk) → 014 (BPB-weighted, higher risk). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
110 LOC pure addition to train_gpt.py, fully env-gated by BIGRAM_HASH_ENABLED=0/1. Default-off invariant: with env unset the forward pass, state_dict, and optimizer param list are byte-identical to baseline. Components: - BigramHashEmbedding(nn.Module): embed(buckets, dim) + CastedLinear proj(dim, model_dim). proj._zero_init=True -> identity at step 0. Hash: ((prime_a * curr) ^ (prime_b * prev)) % buckets. Position-0 fallback: prev = curr (self-bigram). Cross-doc leakage not special cased, matching openai#1736's SmearGate convention. - GPT.__init__: creates self.bigram_embed when enabled else None. - forward_logits + forward_ttt: additive merge of bigram(input_ids) to tok_emb(input_ids) before SmearGate. attr-guarded. - Optimizers: embed.weight -> AdamW optimizer_tok (embed_wd), proj.weight -> Muon matrix_params. - GPTQ hessian hooks: bigram_embed.embed output -> (dim,dim) hessian; bigram_embed.proj input -> (dim,dim) hessian (proj is <=65536 numel so fp16 passthrough; harmless hook). - Startup log line echoing config. Sizing: 16384*32 int6 embed ~= 393KB. 512*32 fp16 proj = 32KB. Total ~425KB added to artifact; budget dry-run needed before launch. Env vars (defaults): BIGRAM_HASH_ENABLED=0, BIGRAM_HASH_BUCKETS=16384, BIGRAM_HASH_DIM=32, BIGRAM_HASH_PRIME_A=36313, BIGRAM_HASH_PRIME_B=27191. Bug lesson learned from exp/training-bundle commit 8d54854: when Edit's old_string only captures part of a for-loop body, trailing loop statements get pushed outside the loop and may be absorbed by nearby conditional blocks. This patch is a pure prepend/append style (no splits of existing blocks) so that failure mode is avoided. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Compiled reference list for architecture-side research thread, including: - XSA identified as Exclusive Self-Attention (Apple, arXiv 2603.09078). Matches openai#1736's _xsa_efficient exactly. - Universal Transformer (Dehghani 2018), ACT (Graves 2016) as foundational recurrence references. - Key 2025 finding from ILR paper (arXiv 2505.01855): allocating more iterations to EARLIER layers yields optimal results. openai#1736's Loop45 (middle layers) may be sub-optimally positioned. - Parallel residuals literature: GPT-J / PaLM well-studied, multi-lane variants (Branchformer etc.) mostly in vision, thin in NLP. - Synthesis of candidate variants prioritized by novelty × EV × cost. - Proposed next step: instrument openai#1736 to log cross-pass cosine similarity during training. If high → cross-pass XSA worth trying. If already low → different variant needed. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Added section on 'when to activate recurrence' research. Key findings: - ProRes, SGT, Staged Training all recommend progressive/curriculum activation over hard switches - Literature has conflicting claims about WHERE convergence happens first (shallow vs deep layers) - Consistent claim: progressive beats hard switch for stability - openai#1736's enable_looping_at=0.35 is a hard switch — suboptimal per lit Candidate variants identified, ranked by implementation cost: env-var sweeps (1,2) vs code-change ramps (3,4). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
… candidates User shared a deep timeline of all recurrence experiments in the PG competition (openai#8 through openai#1739). Several of my previously-proposed experiments have ALREADY BEEN TESTED ON THIS STACK and shown to fail: KILLED: - Timing sweep earlier: openai#1726 showed 0.15 is +0.050 worse; openai#1739 showed step-0 catastrophic (1.3936 bpb) - Progressive ramp: openai#1663 showed hard-onset = smooth, no difference - Position shift: openai#1726 showed layer 2-7 +0.163 worse, layer 5-6 shift +0.006 worse — layer 3-5 IS the empirical sweet spot Also corrected the baseline config: openai#1736 uses LOOP_START=3 LOOP_END=5 (three layers: 3, 4, 5 — "Loop345"), not Loop45 as directory name suggests. 3 layers × 3 passes = 17 virtual layers. VIABLE candidates: - Recur-Alpha (openai#1714, Anakintano): learnable scalar per looped block, init 0 → identity. 6 params. Author's grant ran out before TTT eval so composition with openai#1736's phased TTT is genuinely open. NEW TOP PICK. - Cross-pass XSA: still novel, untested in any PR - Loop3-6 variant (openai#1678): tashapais running it; might wait for result Recommendation updated: port Recur-Alpha onto openai#1736 as spec 015. ~$25, identity-at-init (safe), 30 LOC, direct recurrence question. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Shelving actions: - Wrote research/evaluations/014-bpb-weighted-loss.md with full rationale and revisit criteria (post-deadline only) - Added SHELVED status banner to top of the spec file - Added experiments.md row marking 014 as 🗄️ SHELVED (permanently) Decision: do NOT retune. Magnitude too large (+0.0619 = 62× shelve threshold) to be recoverable via LR sweep. Three-null pattern (011, 013, 014) confirms that incremental ports from different-stack authors do not transfer to openai#1736. Moving budget to spec 015 (Recur-Alpha). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Replica of spec-000-era lr_schedule.py for openai#1736/spec-015's stack. Shows all four training-time schedules on one figure: 1. lr_mul (warmdown) — wallclock-based, starts at step 1207 2. effective LR — MATRIX_LR × lr_mul, concrete numbers 3. Muon momentum — step-based warmup, plateau at step 1500 4. looping_active — hard switch at step 1690 (wallclock 35%) Key non-obvious finding: warmdown (step 1207) begins BEFORE looping activates (step 1690). When recurrence kicks in, LR is already ~17% decayed. This sequencing is baked into openai#1736's defaults. Five distinct training regimes: - [0, 1207]: muon momentum warming, nothing else changing - [1207, 1500]: warmdown begins, muon still warming - [1500, 1690]: warmdown continues, muon plateau, looping still off - [1690]: looping activates (architectural change) - [1690, 4828]: all settled, just linear LR decay Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
- diary/2026-04-21-recur-alpha-findings.md — full story of specs 015/016 single-seed screens: α trajectories side-by-side, 5 findings (α>1 on pass-2, <1 on pass-3 at depth, depth-monotonicity inverts between passes, plateau is path-dependent, late-training rate unchanged), full caveats section, ranked next steps. - research/ideas/beating-1736-note.md — four-run throughput + pipeline comparison (008/015/016/openai#1736). Works backward from target 1.06610 to a 0.00183 gap on pre-quant post-EMA; matched-throughput alone gives 3.3× margin over the gap. Risk ranks TTT composition as the one unknown (GPTQ cost is validated at +0.00947 parity). Concludes: single matched- clock NA run with bug-fixed TTT pipeline (~$10-15) settles the whole story. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Primary submission-candidate run for recur-alpha family. Same commit as 016 (4dd2d63); NA 8xH100 to eliminate JP throughput variance; full training + GPTQ + phased-TTT pipeline end-to-end (no EVAL_ONLY_CHECKPOINT bypass that OOM'd in 016 post-hoc). Goal: post-TTT val_bpb <= 1.06550 (beat openai#1736's 1.06610 by >= 0.0005). Runs regardless of 016b's throughput-tax outcome: - If no tax: high-confidence attempt at openai#1736 beat - If tax: diagnostic for TTT x recur-alpha composition - Either way we capture the post-TTT number that 016 post-hoc missed Single seed 42 first, 3-seed conditional on clear-promote bucket. Costs ~\$10 single-seed, ~\$30-34 with 3-seed confirmation. Includes conditional decision tree on 016b branches and tok/s-logging requirements for direct throughput comparison with 016b's 2xH100 data. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…016 full pipeline" NA-1 has no 8xH100 capacity today. Reframe spec 017 as: run spec 016's commit (4dd2d63) with full training + GPTQ + phased-TTT pipeline end-to- end on whichever region has capacity (JP is fine). Primary purpose is capturing the post-TTT val_bpb that 016's screen (killed early) and 016 post-hoc TTT eval (OOM'd) both missed. On JP expected post-TTT ~1.0679-1.0682 — close to but probably not beating openai#1736's 1.06610. Still worth it: real composition measurement replaces the projection chain. Path fixes: JP volume jlxvxeiol4 mounts at /runpod (not /workspace); example launch command rewritten accordingly. Memory entry added to cross-session reference. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
|
Running |
Submission-quality test of constant-α (017 endpoint values) with full training + GPTQ + phased-TTT pipeline. Pins commit 2895db3 on exp/recur-alpha-constant-full, which extends 018c's constant-α wiring to the TTT forward path. Target: beat openai#1736's 1.06610 post-TTT. Expected range 1.0650-1.0675 based on 018c's 92% throughput recovery + TTT bug fix. Single seed 42 first, 3-seed conditional on clear promote. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
- 4,697 steps (vs 4,828 for 008) due to slow JP node, not constant-α overhead - Per-step quality strictly better than 008/017 at matched steps - Linear extrapolation to step 4828 → post-TTT ~1.0606 (beats openai#1736) - Recommendation: rerun on NA-1 pod Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…ai#1736 by 0.00018 Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Reverts the frozen-α container (buffer or Parameter(requires_grad=False)) back to the learnable Parameter form of 017. Combines 017's recipe with the 021e stack's TTT α fix (931bd7c) and algebraic blend form (d761a22) — both of which 017 was missing. Motivation: - 017's pre-quant post-EMA (1.06861) was the BEST of any 8H run in this session. All frozen-α variants (019b at 1.06951, 021e at 1.06944, etc.) land ~0.0008-0.001 worse. - 017's post-TTT (1.06733) was held back by the TTT α bug (α not applied in forward_ttt). Fixing this should recover ~0.002 of TTT delta. - Algebraic blend form (matches 019b-original's kernel pattern) adds another potential 0.001-0.003 improvement. - Combined projected post-TTT: 1.07781 - 0.01249 = 1.06532 → decisive beat of openai#1736 by 0.00078 and 019b by 0.00096. Implementation: 3-line change on top of 021f (0ad5269). Remove the register_buffer + endpoint-tensor construction, replace with nn.Parameter(torch.ones(...), requires_grad=True). Optimizer guard already handles requires_grad=True correctly (α re-enters scalar_params). dtype=bfloat16 retained from 021e stack (vs 017's original fp32) for blend kernel consistency; no cast needed at blend time. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
|
@codemath3000 thanks for the reproducer — confirmed and patched. This is a prep-script bug only; training and the submitted metric are unaffected. Root cause. Line 157 of Scope. The submitted 1.06549 is on valid data — our seed runs used shards produced by a different internal prep path that already prepends BOS. Fix. Prepend # near module top, with other constants
BOS_ID = 1
# inside the per-doc loop
for text in _iter_docs(args.docs):
transformed = encode_lossless_caps_v2(text)
token_ids = [BOS_ID] + sp.encode(transformed, out_type=int)
if n_docs < args.val_docs:
byte_counts = _token_original_byte_counts(sp, text, transformed)
val_buf_tokens.extend(token_ids)
val_buf_bytes.append(0) # BOS = 0 original bytes
val_buf_bytes.extend(int(b) for b in byte_counts)
else:
train_buf.extend(token_ids)Matches the canonical pattern in Pushed in commit d7263a3 on this branch (and fe7c309 on PR #1769, which ships the same prep script). README now includes a |
External reproductions of PR openai#1769 (and PR openai#1736) failed with ZeroDivisionError in phased TTT eval because the shipped prep script did not prepend the <s> control token (ID 1) to each doc. The SP tokenizer reserves IDs 0-7 (pad/s/</s>/unk + 4 CaseOps operators), so sp.encode cannot emit ID 1 naturally, and train_gpt.py:_find_docs (line 2209) requires BOS markers with no fallback. Training itself ran because _init_shard:408-409 falls back to bos_idx=[0] when no BOS is found; phased TTT eval has no equivalent fallback. Fix: add BOS_ID=1 constant, prepend to each doc's tokens, append 0 to the byte sidecar (BOS = 0 original bytes). Matches the canonical pattern in data/download_hf_docs_and_tokenize.py:364-366. The submitted 1.06453 metric is unaffected — val_bpb reduces to loss_sum/ln(2)/byte_sum (token counts cancel) and byte_sum is unchanged with BOS prepended. Our seed logs were measured on shards that already had BOS markers from an internal prep path; the shipped prep was the outlier. Also adds a Reproduction sanity check section to README.md that asserts bos_count > 0 on the first val shard. Reported by @codemath3000 in PR openai#1736 comment 4285805497.
External reproductions of this submission failed with ZeroDivisionError in phased TTT eval because the shipped prep script did not prepend the <s> control token (ID 1) to each doc. The SP tokenizer reserves IDs 0-7 (pad/s/</s>/unk + 4 CaseOps operators), so sp.encode cannot emit ID 1 naturally, and train_gpt.py:_find_docs (line 2209) requires BOS markers with no fallback. Training ran because _init_shard:408-409 falls back to bos_idx=[0] when no BOS is found; phased TTT eval has no equivalent fallback. Fix: add BOS_ID=1 constant, prepend to each doc's tokens, append 0 to the byte sidecar (BOS = 0 original bytes). Matches the canonical pattern in data/download_hf_docs_and_tokenize.py:364-366. The submitted 1.06549 metric is unaffected — val_bpb reduces to loss_sum/ln(2)/byte_sum (token counts cancel) and byte_sum is unchanged with BOS prepended. Our seed logs were measured on shards that already had BOS markers from an internal prep path; the shipped prep was the outlier. Also adds a Reproduction sanity check section to README.md that asserts bos_count > 0 on the first val shard. Reported by @codemath3000 in PR openai#1736 comment 4285805497.
Seed logs (train_seed{0,42,1234}.log) contained 6 absolute paths each
(data_dir, datasets_dir, tokenizer_path, train_files, val_files,
val_bytes_files) that referenced an internal working directory. Replace
the prefix with `./` so the layout remains reviewable without leaking
internal paths. Code size unchanged across all 3 logs (131,887 bytes).
Also drop `PHASED_TTT_ENABLED=1` from the README Run command — this env
var is not read by train_gpt.py. The two phased-TTT env vars that ARE
read (PHASED_TTT_PREFIX_DOCS, PHASED_TTT_NUM_PHASES) remain. Phased TTT
is gated by the top-level TTT_ENABLED=1 which defaults to on.
Summary
3-seed results (8×H100 80GB SXM, 10-min train / 10-min eval budgets)
All three seeds clear size, train-time, and eval-time budgets with substantial headroom. 3-seed std is 0.00070 BPB — well inside the 0.005 significance floor.
Key innovation — CaseOps tokenizer + byte sidecar
CaseOps is a bijective, character-level text transform that removes English capitalization from the body of the text and records it as four operator tokens (
TITLE,ALLCAPS,CAPNEXT,ESC) that become SentencePieceuser_defined_symbols. Because the transform is fully invertible (decode(encode(s)) == s), no information is lost and BPE merges allocate vocabulary around content instead of around case variants. Ships with a per-token byte sidecar (fineweb_val_bytes_*.bin, uint16 parallel to val shards) so BPB is computed on ORIGINAL pre-transform UTF-8 bytes, not on the transformed representation — the score is on the same FineWeb text, just with a different tokenization front end.Rule compliance
fineweb_train_*.binshards.prepare_caseops_data.pyis deterministic given the input FineWeb doc stream.Test plan
train_gpt.py,prepare_caseops_data.py, tokenizer.model, 3 seed logs,submission.json,README.md,lossless_caps.py).prepare_caseops_data.pyto generate CaseOps shards + val byte sidecar.SEED=42 CASEOPS_ENABLED=1 GATED_ATTN_QUANT_GATE=1 ... torchrun --standalone --nproc_per_node=8 train_gpt.py(full env in README).quantized_ttt_phased val_bpbmatches the logged 1.06610 (±0.0007) within seed noise.Lineage