Skip to content

Record: SP8192 + CaseOps + GatedAttn + QuantGate + Loop45 + PhasedTTT — val_bpb 1.06549#1736

Open
dexhunter wants to merge 3 commits intoopenai:mainfrom
dexhunter:dexhunter/caseops-gatedattn-quantgate-1.06549
Open

Record: SP8192 + CaseOps + GatedAttn + QuantGate + Loop45 + PhasedTTT — val_bpb 1.06549#1736
dexhunter wants to merge 3 commits intoopenai:mainfrom
dexhunter:dexhunter/caseops-gatedattn-quantgate-1.06549

Conversation

@dexhunter
Copy link
Copy Markdown
Contributor

Summary

3-seed results (8×H100 80GB SXM, 10-min train / 10-min eval budgets)

Seed Steps Pre-TTT BPB Post-TTT BPB Artifact (bytes) train_time eval_time
42 4854 1.07847 1.06610 15,978,834 596.18s 396.9s
0 4843 1.07719 1.06473 15,971,476 596.17s 399.3s
1234 4847 1.07811 1.06563 15,975,050 596.08s 395.5s
Mean 4848 1.07792 1.06549 15,975,120 596.14s 397.23s
Std 0.00066 0.00070 3,698 0.06s 1.9s

All three seeds clear size, train-time, and eval-time budgets with substantial headroom. 3-seed std is 0.00070 BPB — well inside the 0.005 significance floor.

Key innovation — CaseOps tokenizer + byte sidecar

CaseOps is a bijective, character-level text transform that removes English capitalization from the body of the text and records it as four operator tokens (TITLE, ALLCAPS, CAPNEXT, ESC) that become SentencePiece user_defined_symbols. Because the transform is fully invertible (decode(encode(s)) == s), no information is lost and BPE merges allocate vocabulary around content instead of around case variants. Ships with a per-token byte sidecar (fineweb_val_bytes_*.bin, uint16 parallel to val shards) so BPB is computed on ORIGINAL pre-transform UTF-8 bytes, not on the transformed representation — the score is on the same FineWeb text, just with a different tokenization front end.

Rule compliance

  • Artifact ≤ 16,000,000 bytes DECIMAL (README FAQ + Issue A Field Guide to Valid Submissions #1017 §II.1): ✅ all seeds ≤ 15,978,834.
  • train_time ≤ 600s (README line 6): ✅ all seeds 596.1s.
  • total_eval_time ≤ 600s (README FAQ, separate budget): ✅ all seeds 395.5–399.3s.
  • Score-first TTT (Issue A Field Guide to Valid Submissions #1017 Condition 3): ✅ phased TTT snapshots the pre-update score on each chunk before the LoRA adapter step; per-doc LoRA reset between documents.
  • BPB on original bytes (Issue A Field Guide to Valid Submissions #1017 §V): ✅ per-token byte sidecar encodes canonical UTF-8 byte count of each val position.
  • No val data in training: ✅ training uses only fineweb_train_*.bin shards.
  • Reproducible: prepare_caseops_data.py is deterministic given the input FineWeb doc stream.

Test plan

  • Organizer reviews submission folder contents (train_gpt.py, prepare_caseops_data.py, tokenizer .model, 3 seed logs, submission.json, README.md, lossless_caps.py).
  • Organizer runs prepare_caseops_data.py to generate CaseOps shards + val byte sidecar.
  • Organizer reproduces at least one seed: SEED=42 CASEOPS_ENABLED=1 GATED_ATTN_QUANT_GATE=1 ... torchrun --standalone --nproc_per_node=8 train_gpt.py (full env in README).
  • Reproduced quantized_ttt_phased val_bpb matches the logged 1.06610 (±0.0007) within seed noise.
  • Artifact size, train_time, total_eval_time all within budgets on re-run.

Lineage

… — val_bpb 1.06549

3-seed mean 1.06549 (std 0.00070) on 8×H100 SXM, all gates green:
- artifact 15,975,120 bytes mean (≤16,000,000 DECIMAL)
- train_time 596.14s mean (≤600s)
- total_eval_time 397.23s mean (≤600s)

Builds on PR openai#1530 SP8192 stack. Adopts CaseOps (lossless_caps_caseops_v1)
bijective case preprocessing from PR openai#1729 with a per-token byte sidecar
so BPB is scored on original pre-transform UTF-8 bytes. Adds a learned
attention out-gate (init_std=0.005) + quant-gate scaling that recovers
the ~40 KB of overhead introduced by the new control tokens, keeping
every seed under the 16 MB decimal cap.

Seeds: 42 (1.06610), 0 (1.06473), 1234 (1.06563).
sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Apr 19, 2026
…ad, MP-SGD TTT 4-phase

- PR openai#1698 (GDN FLA, claimed 1.00995): BPB bug confirmed by dexhunter
  (~1.189 actual) + artifact size violation; effectively dead
- New technique: CaseOps bijective tokenizer (PR openai#1729/openai#1736/openai#1738) —
  reversible case-factoring with byte sidecar; stronger legality than
  casefold; await Issue openai#1604 ruling
- PR openai#1735 (pre-quant TTT 21ep) flagged illegal by dexhunter; PR openai#1738
  builds on it, both likely void
- PR openai#1727 (MP-SGD TTT 4 phases, 1.07217): appears legal, stackable
- Merged SOTA 1.0810 Day 10 plateau; 11 days to deadline

https://claude.ai/code/session_012mo6412sGQRVjF7TDmfx31
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
Bulk import of dexhunter's openai#1736 unmerged submission
(openai#1736, commit e100586) for reproduction as our
new research baseline. Source: records/track_10min_16mb/
2026-04-19_SP8192_CaseOps_GatedAttn_QuantGate_Loop45_PhasedTTT/.

9 files, ~6856 lines:
- train_gpt.py (training script)
- lossless_caps.py (bijective CaseOps transform)
- prepare_caseops_data.py (data retokenization script)
- fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model (SP tokenizer)
- README.md, submission.json, 3 per-seed training logs

No modifications to repo-root files. Spec: research/specs/008-1736-reproduction.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
After 2026-04-19 frontier scan, rebasing the research baseline from
merged SOTA openai#1493 (1.0810) to unmerged PR openai#1736 (dexhunter, claimed
1.06549). Rationale: credible frontier moved ~0.015 bpb past merged
SOTA in 10 days via witnessed, legal levers (CaseOps tokenizer,
attn-out gate, phased TTT). Continuing off spec-000 leaves us behind
before we try anything.

- CLAUDE.md: baseline declared; baseline-migration specs land on
  research directly (exception to exp/<slug> convention).
- research/frontier-map.md: credibility filter + dependency map.
- diary/2026-04-19-frontier-{scan,map}.md: per-PR evidence base.
- research/ideas/1736-improvement.md: three-spec migration plan.
- research/specs/008-1736-reproduction.md: spec for the reproduction
  run, pinned to commit 154c9b8 (openai#1736 import at e100586).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
Collapse spec 008 to seed 42 only and add a one-line pre-GPTQ FP
checkpoint save at runs/008-1736-reproduction/seed_42/pre_gptq.pt
(env-var gated via SAVE_PRE_GPTQ=1 so the reproduction itself is
unaffected when the flag is off).

Rationale: SpinQuant and subsequent quant-family experiments are
purely post-training transforms, so hotstarting off a single
pre-GPTQ FP checkpoint is far cheaper than retraining per spec.
Single-seed comparison against openai#1736's seed-42 (1.06610, ±0.003)
is apples-to-apples for screening. Cost drops ~$40 -> ~$17 for
this spec and ~$10 -> ~$1–2 per downstream quant experiment.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
First lever layered on the new openai#1736 baseline. Hadamard rotation of
weight matrices before GPTQ quantization, hotstarted off spec 008's
pre_gptq.pt FP checkpoint. No retraining.

Witnessed at claimed -0.005 bpb on PR openai#1695 (X-Abhishek-X) on a
openai#1529-adjacent base; expected to compose cleanly with openai#1736 since
the quant stage is orthogonal to CaseOps / attention gates / phased
TTT. Rotation is a post-training transform with three classes
(residual-stream, per-layer attn, per-layer MLP); FP forward pass is
invariant by construction, only quantization error drops.

Cost ~$6 (hotstart off spec 008 checkpoint), vs ~$30 for a full
retrain. Same hotstart checkpoint reused by future quant
experiments (per-group bit, AR-selfgen calib, AWQ).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
Based on reading train_gpt.py at commit 154c9b8:

Good: RMSNorm is gamma-free (line 529), so the usual gamma-fold step
doesn't apply. RMSNorm is rotation-equivariant directly.

Bad: openai#1736 has five OTHER per-channel multipliers on residual flow
(attn_scale, mlp_scale, resid_mix, skip_weights, skip_gates). These
are the real fold targets, not RMSNorm. resid_mix is pre-norm and
cannot be cleanly folded.

Split into three SpinQuant modes selectable by SPINQUANT_MODE:
- internal_only (R_a, R_m per layer; no residual rotation)
- full (internal + R0, with attn_scale/mlp_scale/skip folds and
  resid_mix freeze-to-mean compromise)
- port_1695 (conditional on openai#1695 diff being meaningfully different)

All three run back-to-back on one pod hotstarted off spec 008's
final_model.pt. ~30 min total GPU, ~$17-22 budget, one eval.

research/ideas/spinquant-integration-notes.md captures the full
design analysis (per-multiplier fold feasibility, three-option
tradeoff, shared-code plan).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
Added SPINQUANT_MODE=baseline as a fourth variant that applies no
rotation — just loads final_model.pt, runs serialize/deserialize/
eval/TTT on it. Two purposes:

1. Closes the loop on spec 008's missed post-TTT number (watcher
   stopped the pod before the TTT eval ran). No separate $3
   eval-only rerun needed.
2. Provides the apples-to-apples local reference for measuring the
   three SpinQuant variants' Deltas — removes any cross-pod bf16
   drift from the comparison.

Order: baseline -> internal_only -> full -> port_1695, sequential on
one pod. Gate: if baseline lands outside openai#1736's 1.06610 +/- 0.003,
halt before running rotations (means spec 008 reproduction is off).

Total cost ~$27 (was $22); absorbs ~$3 of otherwise-separate eval
rerun, so net increment is ~$2 for four measured numbers.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
Two new files in the openai#1736 submission dir:

spinquant_hotstart.py (~360 LOC):
- Imports from train_gpt.py for Hyperparameters/GPT/serialize/deserialize/
  eval_val/eval_val_ttt_phased/BatchedTTTLoRA/etc.
- Modes: baseline, internal_only (R_a only, per-layer per-KV-group, d_head
  rotation on V-output and O-input).
- full, port_1695 are stubs — raise NotImplementedError with explanation.
- Pipeline: load FP state_dict from HOTSTART_FP_CKPT -> apply rotations
  in-place on banked qo_bank/kv_bank -> optional pre-quant diagnostic eval
  -> call serialize() (GPTQ+compress) -> deserialize() -> quantized eval
  -> phased TTT eval -> write final.json.
- Reproduces the TTT eval block from train_and_eval (lines 2997-3075) in
  _run_ttt_eval() rather than refactoring the source file.

test_rotation_invariance.py (~250 LOC):
- CPU-only, standalone (no train_gpt.py import due to flash_attn_3/triton
  module-level deps).
- Self-contained minimal attention forward: Q/K/V projection from the
  banked tensors, RMSNorm on Q and K (matches real model's bound on
  attention logits; without this, trained weights saturate softmax and
  float noise in V amplifies catastrophically).
- Tests baseline (bit-exact identity) and internal_only (rel tolerance
  1e-4) against either synthetic random weights or spec 008's
  final_model.pt. Both pass cleanly (rel_max ~1e-6 on real checkpoint).
- Can load either banked (qo_bank/kv_bank) or unbanked
  (blocks.N.attn.*.weight) state_dict format.

Spec 009 updated: reduced scope to 2 modes (baseline, internal_only) for
this session; full and port_1695 deferred. Rationale in the spec: MLP
LeakyReLU-squared breaks R_m float-invariance, resid_mix can't be cleanly
folded through RMSNorm, both needing design before implementation.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
Cleanup pass to resolve inconsistencies between the spec and what's
actually in spinquant_hotstart.py + test_rotation_invariance.py:

- Title + scope: 2-mode sweep (baseline, internal_only); full and
  port_1695 explicitly deferred to a follow-up spec.
- Checkpoint path: pre_gptq.pt (what execution's spec-008 patch
  produced, after _unbank_state_dict), not final_model.pt.
- Accept criteria: preflight via test_rotation_invariance.py
  (ALL TESTS PASS), then per-mode on pod.
- Rotation structure: trimmed to just the implemented R_a class
  with exact banked-tensor indexing. R_0 / R_m / skip-stream /
  RMSNorm-fold sections moved to 'not implemented (deferred)'.
- RMSNorm-fold section removed entirely: openai#1736's RMSNorm is
  gamma-free (F.rms_norm with no weight arg), so no fold needed.
- Code-changes section: points at the files on disk instead of
  TODO pseudocode.
- Execution protocol: 2 modes back-to-back on 8xH100, explicit
  preflight step.
- Hardware ladder: 8xH100 required (phased TTT is 8-rank DDP).
- Cost estimate: ~$15 total for 2 modes.
- Open questions: reframed around unbanked-checkpoint load,
  bf16 drift, GPTQ interaction, phased-TTT compatibility.
- What this spec does NOT do: clarified that residual rotation,
  R_m, resid_mix, and port_1695 are all deferred.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
…approach

Read openai#1695's diff. Their approach is fundamentally different from
the static-weight-rotation + folds design I had in mind for 'full'
mode. They do ONLINE activation rotation: 4 global Hadamard rotations
inserted as x @ R matmuls at 4 forward-pass sites (residual->qkv,
attn-out->proj, residual->fc, hidden->proj). GPTQ then quantizes in
the rotated basis; rotated Hessians keep the quant-side accounting
honest. Rotations OFF during training, ON after deserialize for
eval+TTT.

Why this matters: their scheme sidesteps BOTH blockers that made the
full mode complicated:
- LeakyReLU non-equivariance: R_mlp_proj_in is applied AFTER the
  LeakyReLU-square, not across it.
- resid_mix: rotations are per-linear-input, never touch the
  residual stream. All per-channel multipliers (attn_scale,
  mlp_scale, resid_mix, skip_weights) operate in unchanged basis.

No float invariance — the model IS different post-rotation. The bet
is that the rotated-basis GPTQ delivers lower quant error and that
the perturbation is smaller than the savings.

Implication: deprecate the 'full' static-rotation-with-folds plan in
favor of a future 'port_1695' spec that ports their online scheme.
Internal_only mode from spec 009 remains useful as an independent
data point (R_a only, fp-invariant).

Spec 010 (tapered WD) drafted as an independent parallel track:
- Ports PR openai#1729's WD_TAPER_START_FRAC=0.70, WD_TAPER_FINAL_MULT=0.50
- Muon-WD-only taper on top of openai#1736's existing schedule
- Full retrain on 8xH100, single seed, ~\$20
- Independent of spec 009 (different pod, no shared state)
- Can run in parallel with 009's eval-only sweep

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
…n sprint

Session-narrative entry covering today's work:

- Frontier filter + baseline migration from merged SOTA openai#1493 (1.0810)
  to unmerged openai#1736 (1.0655), rationale, CLAUDE.md update.
- Spec 008 run partial result (training reproduced openai#1736 within
  +0.00016 at pre-quant; post-TTT gate number not captured due to
  watcher bug; projected pass ~1.06626).
- Spec 009 design evolution through three scope cuts: 4 modes ->
  unified sweep -> +baseline mode -> cut to 2 modes after discovering
  real architectural blockers (MLP LeakyReLU breaks R_m, resid_mix
  doesn't fold cleanly).
- openai#1695 diff discovery: they do online activation rotation, not
  static weight rotation. Sidesteps both LeakyReLU and resid_mix.
  Reframes 'full' mode -> port_1695 mode as the next quant-side
  spec.
- Specs 010 (port_1695, design only) and 011 (tapered WD, design
  only) drafted. Only spec 009 is truly runnable right now.

Closes with state-of-play table, modal plan, lessons-learned, and
open questions for next session.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
Implements the port_1695 SpinQuant variant from PR openai#1695 onto the
openai#1736 stack. All changes env-var-gated (SPINQUANT_ENABLED=0 default)
so spec 008 and spec 009's baseline/internal_only modes are
unaffected bit-for-bit.

train_gpt.py changes (+247 lines):
- import hashlib
- Hyperparameters.spinquant_enabled, spinquant_seed
- CastedLinear._sq_active class flag (default False)
- Utility block: _stable_seed, _hadamard_rotation, install_spinquant_
  rotations, _SQ_KEY_TO_TAG, _spinquant_rotate_sd_and_H
- 4 forward-path hook sites (2 each in CausalSelfAttention,
  MLP, _block_with_lora, _parallel_block_with_lora):
  - pre-QKV: x_qkv = x @ R_attn_in
  - pre-attn-proj: y @ R_attn_proj_in
  - pre-fc: x @ R_mlp_in
  - post-activation pre-proj: hidden @ R_mlp_proj_in
- serialize(): call _spinquant_rotate_sd_and_H after Hessian collection
  and before GPTQ. Rotates weights (W @ R) and Hessians (R.T @ H @ R).
- deserialize(): install_spinquant_rotations + set _sq_active=True
  after loading rotated weights.
- MLP.forward: disable fused kernel when SpinQuant active.
- LoRA (TTT path) uses unrotated n, base path uses rotated n_qkv.

spinquant_hotstart.py changes:
- port_1695 mode no longer raises NotImplementedError. Sets
  h.spinquant_enabled=True and h.spinquant_seed; train_gpt.py's
  machinery does the rest.

Math: orthogonal R means R @ R.T == I, so x_rot @ W_rot = x @ R @
(W @ R).T = x @ R @ R.T @ W.T = x @ W.T. Pre-quant forward is
bit-identical to unrotated; GPTQ sees rotated basis where outliers
are spread more evenly and quantization error drops.

Spec 010 doc updated to reflect the implementation state. Execution
runs via SPINQUANT_MODE=port_1695 on spinquant_hotstart.py.

Not tested on GPU — flash_attn_3 not available on the dev box.
Syntax clean. First pod run will verify end-to-end behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
Continuation of the morning diary. Covers:

- Spec 009 baseline closed spec 008's gate at 1.06728 (matches
  openai#1736's 1.06610 within bf16 noise). internal_only null (+0.00003).
- Spec 010 port_1695 also null aggregate (-0.00005), BUT per-batch
  analysis revealed a striking regime-dependent effect: rotation
  helps long-context docs (-0.0064 bpb on dl>1000) and hurts
  short-context docs (+0.0146 on dl<300). The null is a
  cancellation, not an absence of effect.
- 'TTT substitutes for rotation' hypothesis revised — the rotation
  Delta is ~0 at both pre-TTT and post-TTT stages. What rotation
  actually does is shift where in the doc-length distribution the
  model is strong, without changing the aggregate.
- Designed + implemented spec 010b (SPINQUANT_SITES env var) to
  isolate which sites (attn vs MLP) carry the help vs hurt. Ready
  for execution, ~\$25.
- Lessons: look at per-batch trajectory data before concluding a
  null is null. Length-sorted running averages are systematically
  biased. Don't pivot prematurely from a signal you haven't fully
  interrogated.

Still \$163 under project budget.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
Closes the SpinQuant investigation arc with spec 010b's results and
an honest retrospective on the false-signal episode.

Key findings:
- All 5 SpinQuant variants (baseline, internal_only, port_1695,
  attn_only, mlp_only) land within 0.00009 bpb at final val_bpb.
  Pure null. openai#1736 has seed std ~0.00070; we are 10x below that.
- Pt.2's "regime-dependence is exploitable" hypothesis refuted.
  attn_only ≈ baseline on rank 0 (attention rotation does nothing);
  mlp_only has inverse regime from port_1695 (hurts long, helps
  short); neither subset comes close to port_1695's emergent
  rank-0 trajectory lead.
- Rank-0 rb spread across variants: 0.0075 bpb.
  Final val_bpb spread across variants: 0.000085 bpb.
  80x compression from 8-rank aggregation + TTT LoRA uniform
  absorption.

Mistake I owned up to: read rank-0 rb:1.0657 for mlp_only at batch
780 and suggested "mlp_only might actually net positive." Final.json
came out +0.000005 above baseline. Rank-0 rb is rank 0's 1/8 slice,
not a preview of the submission number.

Methodology corrections for future runs:
- Always check final.json before any trend interpretation
- Rank-0 rb is a progress indicator, not a metric preview
- When pre-TTT diagnostic_quantized spread < 0.001, post-TTT will
  be near-identical (TTT LoRA dominates)

Budget: spent ~\$52 of \$200 total. 10 days left.

Next: spec 011 (tapered Muon WD retrain) — upstream of TTT, might
unlock something TTT can't absorb. Patch still unwritten.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
Scanned 200 PRs from 2026-04-11..20. After exclusion filters, 3 candidates
beat spec 011's expected Δ: openai#1682 (GradPower Muon p=0.9), openai#1648 (xIELU +
per-layer QK gain), openai#1555 (Tap-In eval cache). Full artifacts at
~/competition-pr/pr-scan-2026-04-20/.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
…1716)

Two orthogonal training-time levers queued behind spec 011:

- bpb-weighted-loss.md (port openai#1519): weight CE by UTF-8 bytes per token.
  Aligns training objective with eval metric. Risk: SP8192 vocab
  destabilization (author warns on large vocabs) + CaseOps byte LUT
  accounting (~1hr of careful code).

- bigram-hash-embed.md (port openai#1716): 16384×32 hash-table bigram embed
  added to token embedding pre-block-0. ~540K params / ~400KB artifact.
  openai#1736 genuinely lacks this despite prevalence in competitive lineages.

Recommended sequencing: 011 → 012 (QK) → 013 (BigramHash, lower risk)
→ 014 (BPB-weighted, higher risk).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
110 LOC pure addition to train_gpt.py, fully env-gated by
BIGRAM_HASH_ENABLED=0/1. Default-off invariant: with env unset the
forward pass, state_dict, and optimizer param list are byte-identical
to baseline.

Components:
- BigramHashEmbedding(nn.Module): embed(buckets, dim) + CastedLinear
  proj(dim, model_dim). proj._zero_init=True -> identity at step 0.
  Hash: ((prime_a * curr) ^ (prime_b * prev)) % buckets. Position-0
  fallback: prev = curr (self-bigram). Cross-doc leakage not special
  cased, matching openai#1736's SmearGate convention.
- GPT.__init__: creates self.bigram_embed when enabled else None.
- forward_logits + forward_ttt: additive merge of bigram(input_ids)
  to tok_emb(input_ids) before SmearGate. attr-guarded.
- Optimizers: embed.weight -> AdamW optimizer_tok (embed_wd), proj.weight
  -> Muon matrix_params.
- GPTQ hessian hooks: bigram_embed.embed output -> (dim,dim) hessian;
  bigram_embed.proj input -> (dim,dim) hessian (proj is <=65536 numel
  so fp16 passthrough; harmless hook).
- Startup log line echoing config.

Sizing: 16384*32 int6 embed ~= 393KB. 512*32 fp16 proj = 32KB.
Total ~425KB added to artifact; budget dry-run needed before launch.

Env vars (defaults): BIGRAM_HASH_ENABLED=0, BIGRAM_HASH_BUCKETS=16384,
BIGRAM_HASH_DIM=32, BIGRAM_HASH_PRIME_A=36313, BIGRAM_HASH_PRIME_B=27191.

Bug lesson learned from exp/training-bundle commit 8d54854: when Edit's
old_string only captures part of a for-loop body, trailing loop
statements get pushed outside the loop and may be absorbed by nearby
conditional blocks. This patch is a pure prepend/append style (no
splits of existing blocks) so that failure mode is avoided.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
Compiled reference list for architecture-side research thread, including:

- XSA identified as Exclusive Self-Attention (Apple, arXiv 2603.09078).
  Matches openai#1736's _xsa_efficient exactly.

- Universal Transformer (Dehghani 2018), ACT (Graves 2016) as
  foundational recurrence references.

- Key 2025 finding from ILR paper (arXiv 2505.01855): allocating more
  iterations to EARLIER layers yields optimal results. openai#1736's Loop45
  (middle layers) may be sub-optimally positioned.

- Parallel residuals literature: GPT-J / PaLM well-studied, multi-lane
  variants (Branchformer etc.) mostly in vision, thin in NLP.

- Synthesis of candidate variants prioritized by novelty × EV × cost.

- Proposed next step: instrument openai#1736 to log cross-pass cosine
  similarity during training. If high → cross-pass XSA worth trying.
  If already low → different variant needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
Added section on 'when to activate recurrence' research. Key findings:
- ProRes, SGT, Staged Training all recommend progressive/curriculum
  activation over hard switches
- Literature has conflicting claims about WHERE convergence happens
  first (shallow vs deep layers)
- Consistent claim: progressive beats hard switch for stability
- openai#1736's enable_looping_at=0.35 is a hard switch — suboptimal per lit

Candidate variants identified, ranked by implementation cost:
env-var sweeps (1,2) vs code-change ramps (3,4).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
… candidates

User shared a deep timeline of all recurrence experiments in the
PG competition (openai#8 through openai#1739). Several of my previously-proposed
experiments have ALREADY BEEN TESTED ON THIS STACK and shown to fail:

KILLED:
- Timing sweep earlier: openai#1726 showed 0.15 is +0.050 worse; openai#1739
  showed step-0 catastrophic (1.3936 bpb)
- Progressive ramp: openai#1663 showed hard-onset = smooth, no difference
- Position shift: openai#1726 showed layer 2-7 +0.163 worse, layer 5-6 shift
  +0.006 worse — layer 3-5 IS the empirical sweet spot

Also corrected the baseline config: openai#1736 uses LOOP_START=3 LOOP_END=5
(three layers: 3, 4, 5 — "Loop345"), not Loop45 as directory name
suggests. 3 layers × 3 passes = 17 virtual layers.

VIABLE candidates:
- Recur-Alpha (openai#1714, Anakintano): learnable scalar per looped block,
  init 0 → identity. 6 params. Author's grant ran out before TTT eval
  so composition with openai#1736's phased TTT is genuinely open. NEW TOP PICK.
- Cross-pass XSA: still novel, untested in any PR
- Loop3-6 variant (openai#1678): tashapais running it; might wait for result

Recommendation updated: port Recur-Alpha onto openai#1736 as spec 015.
~$25, identity-at-init (safe), 30 LOC, direct recurrence question.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
Shelving actions:
- Wrote research/evaluations/014-bpb-weighted-loss.md with full
  rationale and revisit criteria (post-deadline only)
- Added SHELVED status banner to top of the spec file
- Added experiments.md row marking 014 as 🗄️ SHELVED (permanently)

Decision: do NOT retune. Magnitude too large (+0.0619 = 62× shelve
threshold) to be recoverable via LR sweep. Three-null pattern (011,
013, 014) confirms that incremental ports from different-stack authors
do not transfer to openai#1736. Moving budget to spec 015 (Recur-Alpha).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
Replica of spec-000-era lr_schedule.py for openai#1736/spec-015's stack.
Shows all four training-time schedules on one figure:

  1. lr_mul (warmdown)       — wallclock-based, starts at step 1207
  2. effective LR            — MATRIX_LR × lr_mul, concrete numbers
  3. Muon momentum           — step-based warmup, plateau at step 1500
  4. looping_active          — hard switch at step 1690 (wallclock 35%)

Key non-obvious finding: warmdown (step 1207) begins BEFORE looping
activates (step 1690). When recurrence kicks in, LR is already ~17%
decayed. This sequencing is baked into openai#1736's defaults.

Five distinct training regimes:
- [0, 1207]:    muon momentum warming, nothing else changing
- [1207, 1500]: warmdown begins, muon still warming
- [1500, 1690]: warmdown continues, muon plateau, looping still off
- [1690]:       looping activates (architectural change)
- [1690, 4828]: all settled, just linear LR decay

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 21, 2026
- diary/2026-04-21-recur-alpha-findings.md — full story of specs 015/016
  single-seed screens: α trajectories side-by-side, 5 findings (α>1 on
  pass-2, <1 on pass-3 at depth, depth-monotonicity inverts between
  passes, plateau is path-dependent, late-training rate unchanged), full
  caveats section, ranked next steps.

- research/ideas/beating-1736-note.md — four-run throughput + pipeline
  comparison (008/015/016/openai#1736). Works backward from target 1.06610 to
  a 0.00183 gap on pre-quant post-EMA; matched-throughput alone gives 3.3×
  margin over the gap. Risk ranks TTT composition as the one unknown
  (GPTQ cost is validated at +0.00947 parity). Concludes: single matched-
  clock NA run with bug-fixed TTT pipeline (~$10-15) settles the whole
  story.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 21, 2026
Primary submission-candidate run for recur-alpha family. Same commit as
016 (4dd2d63); NA 8xH100 to eliminate JP throughput variance; full
training + GPTQ + phased-TTT pipeline end-to-end (no EVAL_ONLY_CHECKPOINT
bypass that OOM'd in 016 post-hoc).

Goal: post-TTT val_bpb <= 1.06550 (beat openai#1736's 1.06610 by >= 0.0005).

Runs regardless of 016b's throughput-tax outcome:
- If no tax: high-confidence attempt at openai#1736 beat
- If tax: diagnostic for TTT x recur-alpha composition
- Either way we capture the post-TTT number that 016 post-hoc missed

Single seed 42 first, 3-seed conditional on clear-promote bucket. Costs
~\$10 single-seed, ~\$30-34 with 3-seed confirmation. Includes conditional
decision tree on 016b branches and tok/s-logging requirements for direct
throughput comparison with 016b's 2xH100 data.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 21, 2026
…016 full pipeline"

NA-1 has no 8xH100 capacity today. Reframe spec 017 as: run spec 016's
commit (4dd2d63) with full training + GPTQ + phased-TTT pipeline end-to-
end on whichever region has capacity (JP is fine). Primary purpose is
capturing the post-TTT val_bpb that 016's screen (killed early) and 016
post-hoc TTT eval (OOM'd) both missed.

On JP expected post-TTT ~1.0679-1.0682 — close to but probably not
beating openai#1736's 1.06610. Still worth it: real composition measurement
replaces the projection chain.

Path fixes: JP volume jlxvxeiol4 mounts at /runpod (not /workspace);
example launch command rewritten accordingly. Memory entry added to
cross-session reference.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@codemath3000
Copy link
Copy Markdown

Running prepare_caseops_data.py as published, then running train_gpt.py with PHASED_TTT_ENABLED=1 reproducibly raises ZeroDivisionError: float division by zero at train_gpt.py:2303 in _loss_bpb_from_sumsbyte_sum.item() is 0 because _find_docs (line 2209) returns an empty list. The prep script never inserts BOS markers, and the tokenizer reserves IDs 0–7 (<pad>, <s>, </s>, <unk>, and the four CaseOps operators), so sp.encode can never naturally output id 1. The training loop has a fallback at _init_shard line 408-409 (if self.bos_idx.size == 0: self.bos_idx = np.array([0], ...)) so training completes, but the phased TTT eval path has no analogous fallback. Am I missing a prep step, or should prepare_caseops_data.py be prepending bos_id=1 to each doc (matching download_hf_docs_and_tokenize.py:364-366)?

leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 21, 2026
Submission-quality test of constant-α (017 endpoint values) with
full training + GPTQ + phased-TTT pipeline. Pins commit 2895db3 on
exp/recur-alpha-constant-full, which extends 018c's constant-α
wiring to the TTT forward path.

Target: beat openai#1736's 1.06610 post-TTT. Expected range 1.0650-1.0675
based on 018c's 92% throughput recovery + TTT bug fix. Single seed
42 first, 3-seed conditional on clear promote.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 21, 2026
- 4,697 steps (vs 4,828 for 008) due to slow JP node, not constant-α overhead
- Per-step quality strictly better than 008/017 at matched steps
- Linear extrapolation to step 4828 → post-TTT ~1.0606 (beats openai#1736)
- Recommendation: rerun on NA-1 pod

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 21, 2026
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 22, 2026
Reverts the frozen-α container (buffer or Parameter(requires_grad=False))
back to the learnable Parameter form of 017. Combines 017's recipe with
the 021e stack's TTT α fix (931bd7c) and algebraic blend form (d761a22) —
both of which 017 was missing.

Motivation:
- 017's pre-quant post-EMA (1.06861) was the BEST of any 8H run in this
  session. All frozen-α variants (019b at 1.06951, 021e at 1.06944, etc.)
  land ~0.0008-0.001 worse.
- 017's post-TTT (1.06733) was held back by the TTT α bug (α not applied
  in forward_ttt). Fixing this should recover ~0.002 of TTT delta.
- Algebraic blend form (matches 019b-original's kernel pattern) adds
  another potential 0.001-0.003 improvement.
- Combined projected post-TTT: 1.07781 - 0.01249 = 1.06532 → decisive
  beat of openai#1736 by 0.00078 and 019b by 0.00096.

Implementation: 3-line change on top of 021f (0ad5269). Remove the
register_buffer + endpoint-tensor construction, replace with
nn.Parameter(torch.ones(...), requires_grad=True). Optimizer guard
already handles requires_grad=True correctly (α re-enters scalar_params).

dtype=bfloat16 retained from 021e stack (vs 017's original fp32) for
blend kernel consistency; no cast needed at blend time.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@dexhunter dexhunter changed the title Record: SP8192 + CaseOps + GatedAttn + QuantGate + Loop45 + PhasedTTT — val_bpb 1.06549 Record: SP8192 + CaseOps + GatedAttn + QuantGate + Loop4-5 + PhasedTTT — val_bpb 1.06549 (3-seed mean) Apr 22, 2026
@dexhunter dexhunter changed the title Record: SP8192 + CaseOps + GatedAttn + QuantGate + Loop4-5 + PhasedTTT — val_bpb 1.06549 (3-seed mean) Record: SP8192 + CaseOps + GatedAttn + QuantGate + Loop45 + PhasedTTT — val_bpb 1.06549 Apr 22, 2026
@dexhunter
Copy link
Copy Markdown
Contributor Author

dexhunter commented Apr 23, 2026

@codemath3000 thanks for the reproducer — confirmed and patched. This is a prep-script bug only; training and the submitted metric are unaffected.

Root cause. Line 157 of prepare_caseops_data.py calls sp.encode(transformed) and appends directly to the shard buffer. The SP tokenizer reserves IDs 0–7 (<pad>, <s>, </s>, <unk>, TITLE, ALLCAPS, CAPNEXT, ESC), so sp.encode cannot emit BOS (ID 1) naturally. _find_docs at train_gpt.py:2209 then returns [] and _loss_bpb_from_sums at line 2303 divides by zero. Training survives via the _init_shard:408–409 fallback; phased TTT eval has no analogous fallback.

Scope. The submitted 1.06549 is on valid data — our seed runs used shards produced by a different internal prep path that already prepends BOS. val_bpb reduces to loss_sum / ln(2) / byte_sum (token counts cancel at line 2303) and byte_sum is unchanged with BOS prepended (BOS contributes 0 original bytes). The bug broke reproduction, not the number.

Fix. Prepend BOS_ID = 1 to each doc's tokens and append 0 to the byte-count sidecar for the BOS position:

# near module top, with other constants
BOS_ID = 1

# inside the per-doc loop
for text in _iter_docs(args.docs):
    transformed = encode_lossless_caps_v2(text)
    token_ids = [BOS_ID] + sp.encode(transformed, out_type=int)
    if n_docs < args.val_docs:
        byte_counts = _token_original_byte_counts(sp, text, transformed)
        val_buf_tokens.extend(token_ids)
        val_buf_bytes.append(0)  # BOS = 0 original bytes
        val_buf_bytes.extend(int(b) for b in byte_counts)
    else:
        train_buf.extend(token_ids)

Matches the canonical pattern in data/download_hf_docs_and_tokenize.py:364–366.

Pushed in commit d7263a3 on this branch (and fe7c309 on PR #1769, which ships the same prep script). README now includes a bos_count > 0 sanity check for the first val shard.

dexhunter added a commit to dexhunter/parameter-golf that referenced this pull request Apr 23, 2026
External reproductions of PR openai#1769 (and PR openai#1736) failed with
ZeroDivisionError in phased TTT eval because the shipped prep script
did not prepend the <s> control token (ID 1) to each doc. The SP
tokenizer reserves IDs 0-7 (pad/s/</s>/unk + 4 CaseOps operators),
so sp.encode cannot emit ID 1 naturally, and train_gpt.py:_find_docs
(line 2209) requires BOS markers with no fallback. Training itself
ran because _init_shard:408-409 falls back to bos_idx=[0] when no
BOS is found; phased TTT eval has no equivalent fallback.

Fix: add BOS_ID=1 constant, prepend to each doc's tokens, append 0
to the byte sidecar (BOS = 0 original bytes). Matches the canonical
pattern in data/download_hf_docs_and_tokenize.py:364-366.

The submitted 1.06453 metric is unaffected — val_bpb reduces to
loss_sum/ln(2)/byte_sum (token counts cancel) and byte_sum is
unchanged with BOS prepended. Our seed logs were measured on shards
that already had BOS markers from an internal prep path; the shipped
prep was the outlier.

Also adds a Reproduction sanity check section to README.md that
asserts bos_count > 0 on the first val shard.

Reported by @codemath3000 in PR openai#1736 comment 4285805497.
External reproductions of this submission failed with ZeroDivisionError
in phased TTT eval because the shipped prep script did not prepend the
<s> control token (ID 1) to each doc. The SP tokenizer reserves IDs 0-7
(pad/s/</s>/unk + 4 CaseOps operators), so sp.encode cannot emit ID 1
naturally, and train_gpt.py:_find_docs (line 2209) requires BOS markers
with no fallback. Training ran because _init_shard:408-409 falls back to
bos_idx=[0] when no BOS is found; phased TTT eval has no equivalent
fallback.

Fix: add BOS_ID=1 constant, prepend to each doc's tokens, append 0 to
the byte sidecar (BOS = 0 original bytes). Matches the canonical pattern
in data/download_hf_docs_and_tokenize.py:364-366.

The submitted 1.06549 metric is unaffected — val_bpb reduces to
loss_sum/ln(2)/byte_sum (token counts cancel) and byte_sum is unchanged
with BOS prepended. Our seed logs were measured on shards that already
had BOS markers from an internal prep path; the shipped prep was the
outlier.

Also adds a Reproduction sanity check section to README.md that asserts
bos_count > 0 on the first val shard.

Reported by @codemath3000 in PR openai#1736 comment 4285805497.
Seed logs (train_seed{0,42,1234}.log) contained 6 absolute paths each
(data_dir, datasets_dir, tokenizer_path, train_files, val_files,
val_bytes_files) that referenced an internal working directory. Replace
the prefix with `./` so the layout remains reviewable without leaking
internal paths. Code size unchanged across all 3 logs (131,887 bytes).

Also drop `PHASED_TTT_ENABLED=1` from the README Run command — this env
var is not read by train_gpt.py. The two phased-TTT env vars that ARE
read (PHASED_TTT_PREFIX_DOCS, PHASED_TTT_NUM_PHASES) remain. Phased TTT
is gated by the top-level TTT_ENABLED=1 which defaults to on.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants