Record: SP8192 + CaseOps + GatedAttn + QuantGate + Loop45 + PhasedTTT — val_bpb 1.06549 by dexhunter · Pull Request #1736 · openai/parameter-golf

dexhunter · 2026-04-19T09:55:58Z

Summary

val_bpb 1.06549 (3-seed mean, std 0.00070), val_loss 2.33168 nats/token.
Builds on PR Record: Varlen attention + fused MLP + doc-independent TTT (1.07336) #1530 SP8192 stack. Adopts CaseOps (lossless_caps_caseops_v1) bijective case preprocessing from PR Record: CaseOps Tokenizer + Tapered WD - val_bpb 1.0678 (3-seed mean) #1729 with a per-token byte sidecar so BPB is scored on original pre-transform UTF-8 bytes.
Adds a learned attention out-gate (init_std=0.005) + quant-gate scaling that recovers the overhead introduced by the new control tokens, keeping every seed under the 16 MB decimal cap.

3-seed results (8×H100 80GB SXM, 10-min train / 10-min eval budgets)

Seed	Steps	Pre-TTT BPB	Post-TTT BPB	Artifact (bytes)	train_time	eval_time
42	4854	1.07847	1.06610	15,978,834	596.18s	396.9s
0	4843	1.07719	1.06473	15,971,476	596.17s	399.3s
1234	4847	1.07811	1.06563	15,975,050	596.08s	395.5s
Mean	4848	1.07792	1.06549	15,975,120	596.14s	397.23s
Std		0.00066	0.00070	3,698	0.06s	1.9s

All three seeds clear size, train-time, and eval-time budgets with substantial headroom. 3-seed std is 0.00070 BPB — well inside the 0.005 significance floor.

Key innovation — CaseOps tokenizer + byte sidecar

CaseOps is a bijective, character-level text transform that removes English capitalization from the body of the text and records it as four operator tokens (TITLE, ALLCAPS, CAPNEXT, ESC) that become SentencePiece user_defined_symbols. Because the transform is fully invertible (decode(encode(s)) == s), no information is lost and BPE merges allocate vocabulary around content instead of around case variants. Ships with a per-token byte sidecar (fineweb_val_bytes_*.bin, uint16 parallel to val shards) so BPB is computed on ORIGINAL pre-transform UTF-8 bytes, not on the transformed representation — the score is on the same FineWeb text, just with a different tokenization front end.

Rule compliance

Artifact ≤ 16,000,000 bytes DECIMAL (README FAQ + Issue A Field Guide to Valid Submissions #1017 §II.1): ✅ all seeds ≤ 15,978,834.
train_time ≤ 600s (README line 6): ✅ all seeds 596.1s.
total_eval_time ≤ 600s (README FAQ, separate budget): ✅ all seeds 395.5–399.3s.
Score-first TTT (Issue A Field Guide to Valid Submissions #1017 Condition 3): ✅ phased TTT snapshots the pre-update score on each chunk before the LoRA adapter step; per-doc LoRA reset between documents.
BPB on original bytes (Issue A Field Guide to Valid Submissions #1017 §V): ✅ per-token byte sidecar encodes canonical UTF-8 byte count of each val position.
No val data in training: ✅ training uses only fineweb_train_*.bin shards.
Reproducible: prepare_caseops_data.py is deterministic given the input FineWeb doc stream.

Test plan

Organizer reviews submission folder contents (train_gpt.py, prepare_caseops_data.py, tokenizer .model, 3 seed logs, submission.json, README.md, lossless_caps.py).
Organizer runs prepare_caseops_data.py to generate CaseOps shards + val byte sidecar.
Organizer reproduces at least one seed: SEED=42 CASEOPS_ENABLED=1 GATED_ATTN_QUANT_GATE=1 ... torchrun --standalone --nproc_per_node=8 train_gpt.py (full env in README).
Reproduced quantized_ttt_phased val_bpb matches the logged 1.06610 (±0.0007) within seed noise.
Artifact size, train_time, total_eval_time all within budgets on re-run.

Lineage

@samacqua — PR Record: Varlen attention + fused MLP + doc-independent TTT (1.07336) #1530 base stack.
@romeerp — PR Record: CaseOps Tokenizer + Tapered WD - val_bpb 1.0678 (3-seed mean) #1729 CaseOps concept + byte sidecar accounting.
@MarioPaerle — PR RECORD: SmearGate + Attention Output Gate + Legal TTT | val_bpb=1.07139 #1667 attention gate pattern.
@bigbag — PR Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean) #1493 prior merged SOTA.

… — val_bpb 1.06549 3-seed mean 1.06549 (std 0.00070) on 8×H100 SXM, all gates green: - artifact 15,975,120 bytes mean (≤16,000,000 DECIMAL) - train_time 596.14s mean (≤600s) - total_eval_time 397.23s mean (≤600s) Builds on PR openai#1530 SP8192 stack. Adopts CaseOps (lossless_caps_caseops_v1) bijective case preprocessing from PR openai#1729 with a per-token byte sidecar so BPB is scored on original pre-transform UTF-8 bytes. Adds a learned attention out-gate (init_std=0.005) + quant-gate scaling that recovers the ~40 KB of overhead introduced by the new control tokens, keeping every seed under the 16 MB decimal cap. Seeds: 42 (1.06610), 0 (1.06473), 1234 (1.06563).

…ad, MP-SGD TTT 4-phase - PR openai#1698 (GDN FLA, claimed 1.00995): BPB bug confirmed by dexhunter (~1.189 actual) + artifact size violation; effectively dead - New technique: CaseOps bijective tokenizer (PR openai#1729/openai#1736/openai#1738) — reversible case-factoring with byte sidecar; stronger legality than casefold; await Issue openai#1604 ruling - PR openai#1735 (pre-quant TTT 21ep) flagged illegal by dexhunter; PR openai#1738 builds on it, both likely void - PR openai#1727 (MP-SGD TTT 4 phases, 1.07217): appears legal, stackable - Merged SOTA 1.0810 Day 10 plateau; 11 days to deadline https://claude.ai/code/session_012mo6412sGQRVjF7TDmfx31

Bulk import of dexhunter's openai#1736 unmerged submission (openai#1736, commit e100586) for reproduction as our new research baseline. Source: records/track_10min_16mb/ 2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT/. 9 files, ~6856 lines: - train_gpt.py (training script) - lossless_caps.py (bijective CaseOps transform) - prepare_caseops_data.py (data retokenization script) - fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model (SP tokenizer) - README.md, submission.json, 3 per-seed training logs No modifications to repo-root files. Spec: research/specs/008-1736-reproduction.md. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

After 2026-04-19 frontier scan, rebasing the research baseline from merged SOTA openai#1493 (1.0810) to unmerged PR openai#1736 (dexhunter, claimed 1.06549). Rationale: credible frontier moved ~0.015 bpb past merged SOTA in 10 days via witnessed, legal levers (CaseOps tokenizer, attn-out gate, phased TTT). Continuing off spec-000 leaves us behind before we try anything. - CLAUDE.md: baseline declared; baseline-migration specs land on research directly (exception to exp/<slug> convention). - research/frontier-map.md: credibility filter + dependency map. - diary/2026-04-19-frontier-{scan,map}.md: per-PR evidence base. - research/ideas/1736-improvement.md: three-spec migration plan. - research/specs/008-1736-reproduction.md: spec for the reproduction run, pinned to commit 154c9b8 (openai#1736 import at e100586). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Collapse spec 008 to seed 42 only and add a one-line pre-GPTQ FP checkpoint save at runs/008-1736-reproduction/seed_42/pre_gptq.pt (env-var gated via SAVE_PRE_GPTQ=1 so the reproduction itself is unaffected when the flag is off). Rationale: SpinQuant and subsequent quant-family experiments are purely post-training transforms, so hotstarting off a single pre-GPTQ FP checkpoint is far cheaper than retraining per spec. Single-seed comparison against openai#1736's seed-42 (1.06610, ±0.003) is apples-to-apples for screening. Cost drops ~$40 -> ~$17 for this spec and ~$10 -> ~$1–2 per downstream quant experiment. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

First lever layered on the new openai#1736 baseline. Hadamard rotation of weight matrices before GPTQ quantization, hotstarted off spec 008's pre_gptq.pt FP checkpoint. No retraining. Witnessed at claimed -0.005 bpb on PR openai#1695 (X-Abhishek-X) on a openai#1529-adjacent base; expected to compose cleanly with openai#1736 since the quant stage is orthogonal to CaseOps / attention gates / phased TTT. Rotation is a post-training transform with three classes (residual-stream, per-layer attn, per-layer MLP); FP forward pass is invariant by construction, only quantization error drops. Cost ~$6 (hotstart off spec 008 checkpoint), vs ~$30 for a full retrain. Same hotstart checkpoint reused by future quant experiments (per-group bit, AR-selfgen calib, AWQ). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Based on reading train_gpt.py at commit 154c9b8: Good: RMSNorm is gamma-free (line 529), so the usual gamma-fold step doesn't apply. RMSNorm is rotation-equivariant directly. Bad: openai#1736 has five OTHER per-channel multipliers on residual flow (attn_scale, mlp_scale, resid_mix, skip_weights, skip_gates). These are the real fold targets, not RMSNorm. resid_mix is pre-norm and cannot be cleanly folded. Split into three SpinQuant modes selectable by SPINQUANT_MODE: - internal_only (R_a, R_m per layer; no residual rotation) - full (internal + R0, with attn_scale/mlp_scale/skip folds and resid_mix freeze-to-mean compromise) - port_1695 (conditional on openai#1695 diff being meaningfully different) All three run back-to-back on one pod hotstarted off spec 008's final_model.pt. ~30 min total GPU, ~$17-22 budget, one eval. research/ideas/spinquant-integration-notes.md captures the full design analysis (per-multiplier fold feasibility, three-option tradeoff, shared-code plan). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Added SPINQUANT_MODE=baseline as a fourth variant that applies no rotation — just loads final_model.pt, runs serialize/deserialize/ eval/TTT on it. Two purposes: 1. Closes the loop on spec 008's missed post-TTT number (watcher stopped the pod before the TTT eval ran). No separate $3 eval-only rerun needed. 2. Provides the apples-to-apples local reference for measuring the three SpinQuant variants' Deltas — removes any cross-pod bf16 drift from the comparison. Order: baseline -> internal_only -> full -> port_1695, sequential on one pod. Gate: if baseline lands outside openai#1736's 1.06610 +/- 0.003, halt before running rotations (means spec 008 reproduction is off). Total cost ~$27 (was $22); absorbs ~$3 of otherwise-separate eval rerun, so net increment is ~$2 for four measured numbers. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Two new files in the openai#1736 submission dir: spinquant_hotstart.py (~360 LOC): - Imports from train_gpt.py for Hyperparameters/GPT/serialize/deserialize/ eval_val/eval_val_ttt_phased/BatchedTTTLoRA/etc. - Modes: baseline, internal_only (R_a only, per-layer per-KV-group, d_head rotation on V-output and O-input). - full, port_1695 are stubs — raise NotImplementedError with explanation. - Pipeline: load FP state_dict from HOTSTART_FP_CKPT -> apply rotations in-place on banked qo_bank/kv_bank -> optional pre-quant diagnostic eval -> call serialize() (GPTQ+compress) -> deserialize() -> quantized eval -> phased TTT eval -> write final.json. - Reproduces the TTT eval block from train_and_eval (lines 2997-3075) in _run_ttt_eval() rather than refactoring the source file. test_rotation_invariance.py (~250 LOC): - CPU-only, standalone (no train_gpt.py import due to flash_attn_3/triton module-level deps). - Self-contained minimal attention forward: Q/K/V projection from the banked tensors, RMSNorm on Q and K (matches real model's bound on attention logits; without this, trained weights saturate softmax and float noise in V amplifies catastrophically). - Tests baseline (bit-exact identity) and internal_only (rel tolerance 1e-4) against either synthetic random weights or spec 008's final_model.pt. Both pass cleanly (rel_max ~1e-6 on real checkpoint). - Can load either banked (qo_bank/kv_bank) or unbanked (blocks.N.attn.*.weight) state_dict format. Spec 009 updated: reduced scope to 2 modes (baseline, internal_only) for this session; full and port_1695 deferred. Rationale in the spec: MLP LeakyReLU-squared breaks R_m float-invariance, resid_mix can't be cleanly folded through RMSNorm, both needing design before implementation. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Cleanup pass to resolve inconsistencies between the spec and what's actually in spinquant_hotstart.py + test_rotation_invariance.py: - Title + scope: 2-mode sweep (baseline, internal_only); full and port_1695 explicitly deferred to a follow-up spec. - Checkpoint path: pre_gptq.pt (what execution's spec-008 patch produced, after _unbank_state_dict), not final_model.pt. - Accept criteria: preflight via test_rotation_invariance.py (ALL TESTS PASS), then per-mode on pod. - Rotation structure: trimmed to just the implemented R_a class with exact banked-tensor indexing. R_0 / R_m / skip-stream / RMSNorm-fold sections moved to 'not implemented (deferred)'. - RMSNorm-fold section removed entirely: openai#1736's RMSNorm is gamma-free (F.rms_norm with no weight arg), so no fold needed. - Code-changes section: points at the files on disk instead of TODO pseudocode. - Execution protocol: 2 modes back-to-back on 8xH100, explicit preflight step. - Hardware ladder: 8xH100 required (phased TTT is 8-rank DDP). - Cost estimate: ~$15 total for 2 modes. - Open questions: reframed around unbanked-checkpoint load, bf16 drift, GPTQ interaction, phased-TTT compatibility. - What this spec does NOT do: clarified that residual rotation, R_m, resid_mix, and port_1695 are all deferred. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

…approach Read openai#1695's diff. Their approach is fundamentally different from the static-weight-rotation + folds design I had in mind for 'full' mode. They do ONLINE activation rotation: 4 global Hadamard rotations inserted as x @ R matmuls at 4 forward-pass sites (residual->qkv, attn-out->proj, residual->fc, hidden->proj). GPTQ then quantizes in the rotated basis; rotated Hessians keep the quant-side accounting honest. Rotations OFF during training, ON after deserialize for eval+TTT. Why this matters: their scheme sidesteps BOTH blockers that made the full mode complicated: - LeakyReLU non-equivariance: R_mlp_proj_in is applied AFTER the LeakyReLU-square, not across it. - resid_mix: rotations are per-linear-input, never touch the residual stream. All per-channel multipliers (attn_scale, mlp_scale, resid_mix, skip_weights) operate in unchanged basis. No float invariance — the model IS different post-rotation. The bet is that the rotated-basis GPTQ delivers lower quant error and that the perturbation is smaller than the savings. Implication: deprecate the 'full' static-rotation-with-folds plan in favor of a future 'port_1695' spec that ports their online scheme. Internal_only mode from spec 009 remains useful as an independent data point (R_a only, fp-invariant). Spec 010 (tapered WD) drafted as an independent parallel track: - Ports PR openai#1729's WD_TAPER_START_FRAC=0.70, WD_TAPER_FINAL_MULT=0.50 - Muon-WD-only taper on top of openai#1736's existing schedule - Full retrain on 8xH100, single seed, ~\$20 - Independent of spec 009 (different pod, no shared state) - Can run in parallel with 009's eval-only sweep Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

…n sprint Session-narrative entry covering today's work: - Frontier filter + baseline migration from merged SOTA openai#1493 (1.0810) to unmerged openai#1736 (1.0655), rationale, CLAUDE.md update. - Spec 008 run partial result (training reproduced openai#1736 within +0.00016 at pre-quant; post-TTT gate number not captured due to watcher bug; projected pass ~1.06626). - Spec 009 design evolution through three scope cuts: 4 modes -> unified sweep -> +baseline mode -> cut to 2 modes after discovering real architectural blockers (MLP LeakyReLU breaks R_m, resid_mix doesn't fold cleanly). - openai#1695 diff discovery: they do online activation rotation, not static weight rotation. Sidesteps both LeakyReLU and resid_mix. Reframes 'full' mode -> port_1695 mode as the next quant-side spec. - Specs 010 (port_1695, design only) and 011 (tapered WD, design only) drafted. Only spec 009 is truly runnable right now. Closes with state-of-play table, modal plan, lessons-learned, and open questions for next session. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Implements the port_1695 SpinQuant variant from PR openai#1695 onto the openai#1736 stack. All changes env-var-gated (SPINQUANT_ENABLED=0 default) so spec 008 and spec 009's baseline/internal_only modes are unaffected bit-for-bit. train_gpt.py changes (+247 lines): - import hashlib - Hyperparameters.spinquant_enabled, spinquant_seed - CastedLinear._sq_active class flag (default False) - Utility block: _stable_seed, _hadamard_rotation, install_spinquant_ rotations, _SQ_KEY_TO_TAG, _spinquant_rotate_sd_and_H - 4 forward-path hook sites (2 each in CausalSelfAttention, MLP, _block_with_lora, _parallel_block_with_lora): - pre-QKV: x_qkv = x @ R_attn_in - pre-attn-proj: y @ R_attn_proj_in - pre-fc: x @ R_mlp_in - post-activation pre-proj: hidden @ R_mlp_proj_in - serialize(): call _spinquant_rotate_sd_and_H after Hessian collection and before GPTQ. Rotates weights (W @ R) and Hessians (R.T @ H @ R). - deserialize(): install_spinquant_rotations + set _sq_active=True after loading rotated weights. - MLP.forward: disable fused kernel when SpinQuant active. - LoRA (TTT path) uses unrotated n, base path uses rotated n_qkv. spinquant_hotstart.py changes: - port_1695 mode no longer raises NotImplementedError. Sets h.spinquant_enabled=True and h.spinquant_seed; train_gpt.py's machinery does the rest. Math: orthogonal R means R @ R.T == I, so x_rot @ W_rot = x @ R @ (W @ R).T = x @ R @ R.T @ W.T = x @ W.T. Pre-quant forward is bit-identical to unrotated; GPTQ sees rotated basis where outliers are spread more evenly and quantization error drops. Spec 010 doc updated to reflect the implementation state. Execution runs via SPINQUANT_MODE=port_1695 on spinquant_hotstart.py. Not tested on GPU — flash_attn_3 not available on the dev box. Syntax clean. First pod run will verify end-to-end behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Continuation of the morning diary. Covers: - Spec 009 baseline closed spec 008's gate at 1.06728 (matches openai#1736's 1.06610 within bf16 noise). internal_only null (+0.00003). - Spec 010 port_1695 also null aggregate (-0.00005), BUT per-batch analysis revealed a striking regime-dependent effect: rotation helps long-context docs (-0.0064 bpb on dl>1000) and hurts short-context docs (+0.0146 on dl<300). The null is a cancellation, not an absence of effect. - 'TTT substitutes for rotation' hypothesis revised — the rotation Delta is ~0 at both pre-TTT and post-TTT stages. What rotation actually does is shift where in the doc-length distribution the model is strong, without changing the aggregate. - Designed + implemented spec 010b (SPINQUANT_SITES env var) to isolate which sites (attn vs MLP) carry the help vs hurt. Ready for execution, ~\$25. - Lessons: look at per-batch trajectory data before concluding a null is null. Length-sorted running averages are systematically biased. Don't pivot prematurely from a signal you haven't fully interrogated. Still \$163 under project budget. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Closes the SpinQuant investigation arc with spec 010b's results and an honest retrospective on the false-signal episode. Key findings: - All 5 SpinQuant variants (baseline, internal_only, port_1695, attn_only, mlp_only) land within 0.00009 bpb at final val_bpb. Pure null. openai#1736 has seed std ~0.00070; we are 10x below that. - Pt.2's "regime-dependence is exploitable" hypothesis refuted. attn_only ≈ baseline on rank 0 (attention rotation does nothing); mlp_only has inverse regime from port_1695 (hurts long, helps short); neither subset comes close to port_1695's emergent rank-0 trajectory lead. - Rank-0 rb spread across variants: 0.0075 bpb. Final val_bpb spread across variants: 0.000085 bpb. 80x compression from 8-rank aggregation + TTT LoRA uniform absorption. Mistake I owned up to: read rank-0 rb:1.0657 for mlp_only at batch 780 and suggested "mlp_only might actually net positive." Final.json came out +0.000005 above baseline. Rank-0 rb is rank 0's 1/8 slice, not a preview of the submission number. Methodology corrections for future runs: - Always check final.json before any trend interpretation - Rank-0 rb is a progress indicator, not a metric preview - When pre-TTT diagnostic_quantized spread < 0.001, post-TTT will be near-identical (TTT LoRA dominates) Budget: spent ~\$52 of \$200 total. 10 days left. Next: spec 011 (tapered Muon WD retrain) — upstream of TTT, might unlock something TTT can't absorb. Patch still unwritten. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Scanned 200 PRs from 2026-04-11..20. After exclusion filters, 3 candidates beat spec 011's expected Δ: openai#1682 (GradPower Muon p=0.9), openai#1648 (xIELU + per-layer QK gain), openai#1555 (Tap-In eval cache). Full artifacts at ~/competition-pr/pr-scan-2026-04-20/. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

…1716) Two orthogonal training-time levers queued behind spec 011: - bpb-weighted-loss.md (port openai#1519): weight CE by UTF-8 bytes per token. Aligns training objective with eval metric. Risk: SP8192 vocab destabilization (author warns on large vocabs) + CaseOps byte LUT accounting (~1hr of careful code). - bigram-hash-embed.md (port openai#1716): 16384×32 hash-table bigram embed added to token embedding pre-block-0. ~540K params / ~400KB artifact. openai#1736 genuinely lacks this despite prevalence in competitive lineages. Recommended sequencing: 011 → 012 (QK) → 013 (BigramHash, lower risk) → 014 (BPB-weighted, higher risk). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

110 LOC pure addition to train_gpt.py, fully env-gated by BIGRAM_HASH_ENABLED=0/1. Default-off invariant: with env unset the forward pass, state_dict, and optimizer param list are byte-identical to baseline. Components: - BigramHashEmbedding(nn.Module): embed(buckets, dim) + CastedLinear proj(dim, model_dim). proj._zero_init=True -> identity at step 0. Hash: ((prime_a * curr) ^ (prime_b * prev)) % buckets. Position-0 fallback: prev = curr (self-bigram). Cross-doc leakage not special cased, matching openai#1736's SmearGate convention. - GPT.__init__: creates self.bigram_embed when enabled else None. - forward_logits + forward_ttt: additive merge of bigram(input_ids) to tok_emb(input_ids) before SmearGate. attr-guarded. - Optimizers: embed.weight -> AdamW optimizer_tok (embed_wd), proj.weight -> Muon matrix_params. - GPTQ hessian hooks: bigram_embed.embed output -> (dim,dim) hessian; bigram_embed.proj input -> (dim,dim) hessian (proj is <=65536 numel so fp16 passthrough; harmless hook). - Startup log line echoing config. Sizing: 16384*32 int6 embed ~= 393KB. 512*32 fp16 proj = 32KB. Total ~425KB added to artifact; budget dry-run needed before launch. Env vars (defaults): BIGRAM_HASH_ENABLED=0, BIGRAM_HASH_BUCKETS=16384, BIGRAM_HASH_DIM=32, BIGRAM_HASH_PRIME_A=36313, BIGRAM_HASH_PRIME_B=27191. Bug lesson learned from exp/training-bundle commit 8d54854: when Edit's old_string only captures part of a for-loop body, trailing loop statements get pushed outside the loop and may be absorbed by nearby conditional blocks. This patch is a pure prepend/append style (no splits of existing blocks) so that failure mode is avoided. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Compiled reference list for architecture-side research thread, including: - XSA identified as Exclusive Self-Attention (Apple, arXiv 2603.09078). Matches openai#1736's _xsa_efficient exactly. - Universal Transformer (Dehghani 2018), ACT (Graves 2016) as foundational recurrence references. - Key 2025 finding from ILR paper (arXiv 2505.01855): allocating more iterations to EARLIER layers yields optimal results. openai#1736's Loop45 (middle layers) may be sub-optimally positioned. - Parallel residuals literature: GPT-J / PaLM well-studied, multi-lane variants (Branchformer etc.) mostly in vision, thin in NLP. - Synthesis of candidate variants prioritized by novelty × EV × cost. - Proposed next step: instrument openai#1736 to log cross-pass cosine similarity during training. If high → cross-pass XSA worth trying. If already low → different variant needed. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Added section on 'when to activate recurrence' research. Key findings: - ProRes, SGT, Staged Training all recommend progressive/curriculum activation over hard switches - Literature has conflicting claims about WHERE convergence happens first (shallow vs deep layers) - Consistent claim: progressive beats hard switch for stability - openai#1736's enable_looping_at=0.35 is a hard switch — suboptimal per lit Candidate variants identified, ranked by implementation cost: env-var sweeps (1,2) vs code-change ramps (3,4). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

… candidates User shared a deep timeline of all recurrence experiments in the PG competition (openai#8 through openai#1739). Several of my previously-proposed experiments have ALREADY BEEN TESTED ON THIS STACK and shown to fail: KILLED: - Timing sweep earlier: openai#1726 showed 0.15 is +0.050 worse; openai#1739 showed step-0 catastrophic (1.3936 bpb) - Progressive ramp: openai#1663 showed hard-onset = smooth, no difference - Position shift: openai#1726 showed layer 2-7 +0.163 worse, layer 5-6 shift +0.006 worse — layer 3-5 IS the empirical sweet spot Also corrected the baseline config: openai#1736 uses LOOP_START=3 LOOP_END=5 (three layers: 3, 4, 5 — "Loop345"), not Loop45 as directory name suggests. 3 layers × 3 passes = 17 virtual layers. VIABLE candidates: - Recur-Alpha (openai#1714, Anakintano): learnable scalar per looped block, init 0 → identity. 6 params. Author's grant ran out before TTT eval so composition with openai#1736's phased TTT is genuinely open. NEW TOP PICK. - Cross-pass XSA: still novel, untested in any PR - Loop3-6 variant (openai#1678): tashapais running it; might wait for result Recommendation updated: port Recur-Alpha onto openai#1736 as spec 015. ~$25, identity-at-init (safe), 30 LOC, direct recurrence question. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Shelving actions: - Wrote research/evaluations/014-bpb-weighted-loss.md with full rationale and revisit criteria (post-deadline only) - Added SHELVED status banner to top of the spec file - Added experiments.md row marking 014 as 🗄️ SHELVED (permanently) Decision: do NOT retune. Magnitude too large (+0.0619 = 62× shelve threshold) to be recoverable via LR sweep. Three-null pattern (011, 013, 014) confirms that incremental ports from different-stack authors do not transfer to openai#1736. Moving budget to spec 015 (Recur-Alpha). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Replica of spec-000-era lr_schedule.py for openai#1736/spec-015's stack. Shows all four training-time schedules on one figure: 1. lr_mul (warmdown) — wallclock-based, starts at step 1207 2. effective LR — MATRIX_LR × lr_mul, concrete numbers 3. Muon momentum — step-based warmup, plateau at step 1500 4. looping_active — hard switch at step 1690 (wallclock 35%) Key non-obvious finding: warmdown (step 1207) begins BEFORE looping activates (step 1690). When recurrence kicks in, LR is already ~17% decayed. This sequencing is baked into openai#1736's defaults. Five distinct training regimes: - [0, 1207]: muon momentum warming, nothing else changing - [1207, 1500]: warmdown begins, muon still warming - [1500, 1690]: warmdown continues, muon plateau, looping still off - [1690]: looping activates (architectural change) - [1690, 4828]: all settled, just linear LR decay Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

- diary/2026-04-21-recur-alpha-findings.md — full story of specs 015/016 single-seed screens: α trajectories side-by-side, 5 findings (α>1 on pass-2, <1 on pass-3 at depth, depth-monotonicity inverts between passes, plateau is path-dependent, late-training rate unchanged), full caveats section, ranked next steps. - research/ideas/beating-1736-note.md — four-run throughput + pipeline comparison (008/015/016/openai#1736). Works backward from target 1.06610 to a 0.00183 gap on pre-quant post-EMA; matched-throughput alone gives 3.3× margin over the gap. Risk ranks TTT composition as the one unknown (GPTQ cost is validated at +0.00947 parity). Concludes: single matched- clock NA run with bug-fixed TTT pipeline (~$10-15) settles the whole story. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Primary submission-candidate run for recur-alpha family. Same commit as 016 (4dd2d63); NA 8xH100 to eliminate JP throughput variance; full training + GPTQ + phased-TTT pipeline end-to-end (no EVAL_ONLY_CHECKPOINT bypass that OOM'd in 016 post-hoc). Goal: post-TTT val_bpb <= 1.06550 (beat openai#1736's 1.06610 by >= 0.0005). Runs regardless of 016b's throughput-tax outcome: - If no tax: high-confidence attempt at openai#1736 beat - If tax: diagnostic for TTT x recur-alpha composition - Either way we capture the post-TTT number that 016 post-hoc missed Single seed 42 first, 3-seed conditional on clear-promote bucket. Costs ~\$10 single-seed, ~\$30-34 with 3-seed confirmation. Includes conditional decision tree on 016b branches and tok/s-logging requirements for direct throughput comparison with 016b's 2xH100 data. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

…016 full pipeline" NA-1 has no 8xH100 capacity today. Reframe spec 017 as: run spec 016's commit (4dd2d63) with full training + GPTQ + phased-TTT pipeline end-to- end on whichever region has capacity (JP is fine). Primary purpose is capturing the post-TTT val_bpb that 016's screen (killed early) and 016 post-hoc TTT eval (OOM'd) both missed. On JP expected post-TTT ~1.0679-1.0682 — close to but probably not beating openai#1736's 1.06610. Still worth it: real composition measurement replaces the projection chain. Path fixes: JP volume jlxvxeiol4 mounts at /runpod (not /workspace); example launch command rewritten accordingly. Memory entry added to cross-session reference. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

codemath3000 · 2026-04-21T03:52:16Z

Running prepare_caseops_data.py as published, then running train_gpt.py with PHASED_TTT_ENABLED=1 reproducibly raises ZeroDivisionError: float division by zero at train_gpt.py:2303 in _loss_bpb_from_sums — byte_sum.item() is 0 because _find_docs (line 2209) returns an empty list. The prep script never inserts BOS markers, and the tokenizer reserves IDs 0–7 (<pad>, <s>, </s>, <unk>, and the four CaseOps operators), so sp.encode can never naturally output id 1. The training loop has a fallback at _init_shard line 408-409 (if self.bos_idx.size == 0: self.bos_idx = np.array([0], ...)) so training completes, but the phased TTT eval path has no analogous fallback. Am I missing a prep step, or should prepare_caseops_data.py be prepending bos_id=1 to each doc (matching download_hf_docs_and_tokenize.py:364-366)?

Submission-quality test of constant-α (017 endpoint values) with full training + GPTQ + phased-TTT pipeline. Pins commit 2895db3 on exp/recur-alpha-constant-full, which extends 018c's constant-α wiring to the TTT forward path. Target: beat openai#1736's 1.06610 post-TTT. Expected range 1.0650-1.0675 based on 018c's 92% throughput recovery + TTT bug fix. Single seed 42 first, 3-seed conditional on clear promote. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

- 4,697 steps (vs 4,828 for 008) due to slow JP node, not constant-α overhead - Per-step quality strictly better than 008/017 at matched steps - Linear extrapolation to step 4828 → post-TTT ~1.0606 (beats openai#1736) - Recommendation: rerun on NA-1 pod Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

…ai#1736 by 0.00018 Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Reverts the frozen-α container (buffer or Parameter(requires_grad=False)) back to the learnable Parameter form of 017. Combines 017's recipe with the 021e stack's TTT α fix (931bd7c) and algebraic blend form (d761a22) — both of which 017 was missing. Motivation: - 017's pre-quant post-EMA (1.06861) was the BEST of any 8H run in this session. All frozen-α variants (019b at 1.06951, 021e at 1.06944, etc.) land ~0.0008-0.001 worse. - 017's post-TTT (1.06733) was held back by the TTT α bug (α not applied in forward_ttt). Fixing this should recover ~0.002 of TTT delta. - Algebraic blend form (matches 019b-original's kernel pattern) adds another potential 0.001-0.003 improvement. - Combined projected post-TTT: 1.07781 - 0.01249 = 1.06532 → decisive beat of openai#1736 by 0.00078 and 019b by 0.00096. Implementation: 3-line change on top of 021f (0ad5269). Remove the register_buffer + endpoint-tensor construction, replace with nn.Parameter(torch.ones(...), requires_grad=True). Optimizer guard already handles requires_grad=True correctly (α re-enters scalar_params). dtype=bfloat16 retained from 021e stack (vs 017's original fp32) for blend kernel consistency; no cast needed at blend time. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

dexhunter · 2026-04-23T00:39:21Z

@codemath3000 thanks for the reproducer — confirmed and patched. This is a prep-script bug only; training and the submitted metric are unaffected.

Root cause. Line 157 of prepare_caseops_data.py calls sp.encode(transformed) and appends directly to the shard buffer. The SP tokenizer reserves IDs 0–7 (<pad>, <s>, </s>, <unk>, TITLE, ALLCAPS, CAPNEXT, ESC), so sp.encode cannot emit BOS (ID 1) naturally. _find_docs at train_gpt.py:2209 then returns [] and _loss_bpb_from_sums at line 2303 divides by zero. Training survives via the _init_shard:408–409 fallback; phased TTT eval has no analogous fallback.

Scope. The submitted 1.06549 is on valid data — our seed runs used shards produced by a different internal prep path that already prepends BOS. val_bpb reduces to loss_sum / ln(2) / byte_sum (token counts cancel at line 2303) and byte_sum is unchanged with BOS prepended (BOS contributes 0 original bytes). The bug broke reproduction, not the number.

Fix. Prepend BOS_ID = 1 to each doc's tokens and append 0 to the byte-count sidecar for the BOS position:

# near module top, with other constants
BOS_ID = 1

# inside the per-doc loop
for text in _iter_docs(args.docs):
    transformed = encode_lossless_caps_v2(text)
    token_ids = [BOS_ID] + sp.encode(transformed, out_type=int)
    if n_docs < args.val_docs:
        byte_counts = _token_original_byte_counts(sp, text, transformed)
        val_buf_tokens.extend(token_ids)
        val_buf_bytes.append(0)  # BOS = 0 original bytes
        val_buf_bytes.extend(int(b) for b in byte_counts)
    else:
        train_buf.extend(token_ids)

Matches the canonical pattern in data/download_hf_docs_and_tokenize.py:364–366.

Pushed in commit d7263a3 on this branch (and fe7c309 on PR #1769, which ships the same prep script). README now includes a bos_count > 0 sanity check for the first val shard.

@codemath3000

External reproductions of PR openai#1769 (and PR openai#1736) failed with ZeroDivisionError in phased TTT eval because the shipped prep script did not prepend the <s> control token (ID 1) to each doc. The SP tokenizer reserves IDs 0-7 (pad/s/</s>/unk + 4 CaseOps operators), so sp.encode cannot emit ID 1 naturally, and train_gpt.py:_find_docs (line 2209) requires BOS markers with no fallback. Training itself ran because _init_shard:408-409 falls back to bos_idx=[0] when no BOS is found; phased TTT eval has no equivalent fallback. Fix: add BOS_ID=1 constant, prepend to each doc's tokens, append 0 to the byte sidecar (BOS = 0 original bytes). Matches the canonical pattern in data/download_hf_docs_and_tokenize.py:364-366. The submitted 1.06453 metric is unaffected — val_bpb reduces to loss_sum/ln(2)/byte_sum (token counts cancel) and byte_sum is unchanged with BOS prepended. Our seed logs were measured on shards that already had BOS markers from an internal prep path; the shipped prep was the outlier. Also adds a Reproduction sanity check section to README.md that asserts bos_count > 0 on the first val shard. Reported by @codemath3000 in PR openai#1736 comment 4285805497.

@codemath3000

External reproductions of this submission failed with ZeroDivisionError in phased TTT eval because the shipped prep script did not prepend the <s> control token (ID 1) to each doc. The SP tokenizer reserves IDs 0-7 (pad/s/</s>/unk + 4 CaseOps operators), so sp.encode cannot emit ID 1 naturally, and train_gpt.py:_find_docs (line 2209) requires BOS markers with no fallback. Training ran because _init_shard:408-409 falls back to bos_idx=[0] when no BOS is found; phased TTT eval has no equivalent fallback. Fix: add BOS_ID=1 constant, prepend to each doc's tokens, append 0 to the byte sidecar (BOS = 0 original bytes). Matches the canonical pattern in data/download_hf_docs_and_tokenize.py:364-366. The submitted 1.06549 metric is unaffected — val_bpb reduces to loss_sum/ln(2)/byte_sum (token counts cancel) and byte_sum is unchanged with BOS prepended. Our seed logs were measured on shards that already had BOS markers from an internal prep path; the shipped prep was the outlier. Also adds a Reproduction sanity check section to README.md that asserts bos_count > 0 on the first val shard. Reported by @codemath3000 in PR openai#1736 comment 4285805497.

Seed logs (train_seed{0,42,1234}.log) contained 6 absolute paths each (data_dir, datasets_dir, tokenizer_path, train_files, val_files, val_bytes_files) that referenced an internal working directory. Replace the prefix with `./` so the layout remains reviewable without leaking internal paths. Code size unchanged across all 3 logs (131,887 bytes). Also drop `PHASED_TTT_ENABLED=1` from the README Run command — this env var is not read by train_gpt.py. The two phased-TTT env vars that ARE read (PHASED_TTT_PREFIX_DOCS, PHASED_TTT_NUM_PHASES) remain. Phased TTT is gated by the top-level TTT_ENABLED=1 which defaults to on.

romeerp mentioned this pull request Apr 20, 2026

Record: CaseOps Tokenizer + Recurrence Depth Curriculum + Base Arch Stack — val_bpb 1.06505 #1756

Open

leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 21, 2026

eval 019b: algebraic lerp null result — post-TTT 1.06628, misses open…

3719151

…ai#1736 by 0.00018 Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

tashapais mentioned this pull request Apr 22, 2026

SP8192 + CaseOps + Loop345 + Recur-Alpha + PhasedTTT #1766

Open

5 tasks

dexhunter changed the title ~~Record: SP8192 + CaseOps + GatedAttn + QuantGate + Loop45 + PhasedTTT — val_bpb 1.06549~~ Record: SP8192 + CaseOps + GatedAttn + QuantGate + Loop4-5 + PhasedTTT — val_bpb 1.06549 (3-seed mean) Apr 22, 2026

dexhunter changed the title ~~Record: SP8192 + CaseOps + GatedAttn + QuantGate + Loop4-5 + PhasedTTT — val_bpb 1.06549 (3-seed mean)~~ Record: SP8192 + CaseOps + GatedAttn + QuantGate + Loop45 + PhasedTTT — val_bpb 1.06549 Apr 22, 2026

bigbag mentioned this pull request Apr 22, 2026

Record: SP8192 CaseOps + V13 Curriculum + SmearGate + LoRA-TTT — val_bpb 1.06513 (3-seed mean) #1771

Open

3 tasks

leon2k2k2k mentioned this pull request Apr 22, 2026

Record: SP8192 + CaseOps + Gated Attention + Quant Gate + Loop4-5 + Phased TTT + Frozen Recurrent Alpha — val_bpb 1.06421 #1779

Open

3 tasks

dexhunter mentioned this pull request Apr 23, 2026

Record: SP8192 + CaseOps + GatedAttn + QuantGate + Loop4-5 + PhasedTTT + MLPClip12 — val_bpb 1.06453 (5-seed mean) #1769

Open

2 tasks

renqianluo mentioned this pull request Apr 23, 2026

Record: GatedAttn + Alpha-Scaled LoRA + Warm-start A + WD 1.0 — val_bpb 1.07081 (3-seed mean) #1784

Open

nprime06 mentioned this pull request Apr 23, 2026

Record: PR #1736 + Polar Express NS + MIN_LR + Sparse Attn Gate + Fused CE + PR #1767 TTT — val_bpb 1.06335 #1787

Open

6 tasks

This was referenced Apr 23, 2026

Record: SP4096 + byte-level PPM adaptive-λ mixture — val_bpb 1.01925 (3-seed) #1785

Closed

Record: K_KVShare_Wider FLA — val_bpb 1.0339 (3-seed mean) #1791

Open

renqianluo mentioned this pull request Apr 23, 2026

Record: Polar Express NS + MIN_LR + GatedAttn + Alpha LoRA — val_bpb 1.07006 (3-seed mean) #1792

Open

dexhunter mentioned this pull request Apr 24, 2026

Record: PR #1787 base + Smear Gate + LQER Asym — val_bpb 1.06157 #1797

Open

8 tasks

This was referenced Apr 24, 2026

Record: #1787 + Sparse Gate + Updated Frozen Carry — val_bpb 1.06287 #1798

Closed

Record: #1787 + Sparse Gate + Updated Frozen Carry — val_bpb 1.06287 #1800

Closed

Record: #1787 + Sparse Gate + Updated Frozen Carry — val_bpb 1.06287 #1801

Open

abi2024 mentioned this pull request Apr 24, 2026

Audit 1698 lineage bpb bytecount #1804

Open

aquariouseworkman mentioned this pull request Apr 27, 2026

Record: val_bpb = 1.06128 SmearGate BOS Fix + PR #1787 Base + Smear Gate + LQER Asymmetric + Phased TTT #1851

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: SP8192 + CaseOps + GatedAttn + QuantGate + Loop45 + PhasedTTT — val_bpb 1.06549#1736

Record: SP8192 + CaseOps + GatedAttn + QuantGate + Loop45 + PhasedTTT — val_bpb 1.06549#1736
dexhunter wants to merge 3 commits intoopenai:mainfrom
dexhunter:dexhunter/caseops-gatedattn-quantgate-1.06549

dexhunter commented Apr 19, 2026

Uh oh!

codemath3000 commented Apr 21, 2026

Uh oh!

dexhunter commented Apr 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dexhunter commented Apr 19, 2026

Summary

3-seed results (8×H100 80GB SXM, 10-min train / 10-min eval budgets)

Key innovation — CaseOps tokenizer + byte sidecar

Rule compliance

Test plan

Lineage

Uh oh!

codemath3000 commented Apr 21, 2026

Uh oh!

dexhunter commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dexhunter commented Apr 23, 2026 •

edited

Loading