Skip to content

Record candidate: 1.06032 CaseOps + Matrix-LR 0.028 + TTT n=1#1925

Open
simon-marcus wants to merge 3 commits intoopenai:mainfrom
simon-marcus:codex/caseops-matrixlr-ttt3500
Open

Record candidate: 1.06032 CaseOps + Matrix-LR 0.028 + TTT n=1#1925
simon-marcus wants to merge 3 commits intoopenai:mainfrom
simon-marcus:codex/caseops-matrixlr-ttt3500

Conversation

@simon-marcus
Copy link
Copy Markdown

@simon-marcus simon-marcus commented Apr 29, 2026

Record candidate: CaseOps + Matrix-LR 0.028 + TTT n=1 / LoRA LR 8e-5

val_bpb: 1.06032 (seed-matched 3-seed composite mean vs #1855, std 0.00114) | 15.90 MB max | 8xH100 SXM | 600s train | score-first TTT eval

This updates the original PR #1925 result by reusing the same trained/quantized artifacts and changing only the legal score-first TTT eval procedure:

  • PHASED_TTT_PREFIX_DOCS=3500
  • PHASED_TTT_NUM_PHASES=1
  • TTT_LORA_LR=0.00008

No retraining or re-quantization is included in the updated headline result. The composite logs make that explicit: each contains the original train/quant section, a COMPOSITE EVAL-ONLY UPDATE FOR PR #1925 marker, and the updated TTT eval continuation.

Seed-Matched 3-Seed Results

Primary score report uses the exact seed set from #1855 (42, 0, 1234) for a direct paired comparison.

Seed Updated log Steps Pre-quant BPB Quantized BPB Updated TTT BPB Artifact bytes Eval time Delta vs #1855
42 train_seed42_ttt_n1_lora8e5.log 4,994 1.06350701 1.07204922 1.05906444 15,896,241 367.2s -0.00083010
0 train_seed0_ttt_n1_lora8e5.log 4,975 1.06523234 1.07359331 1.06059202 15,898,523 370.1s -0.00065411
1234 train_seed1234_ttt_n1_lora8e5.log 4,965 1.06571315 1.07432062 1.06129561 15,902,776 367.8s -0.00079134
Mean 4,978 1.06481750 1.07332105 1.06031736 15,899,180 368.4s -0.00075852

Seed-matched std over the updated TTT BPBs is 0.00114. The matched #1855 mean is 1.06107587, so this improves the paired comparison by 0.00075852 BPB. It also improves the original PR #1925 mean 1.06049099 by 0.00017363 BPB, with all three matched seeds individually better.

What Changed Since Initial PR #1925

  • Updated train_gpt.py default TTT_LORA_LR from 0.0001 to 0.00008.
  • Kept the existing PHASED_TTT_PREFIX_DOCS=3500 and PHASED_TTT_NUM_PHASES=1 default.
  • Added TTT_EVAL_UPDATE.md with the compact matched eval-only table.
  • Added composite logs for seeds 0, 42, and 1234.
  • Did not add model artifacts; only logs/docs/code metadata are included.

Compliance / Legality

  • Training is capped at 600s on 8xH100; seed-matched logs show 599.45s to 599.64s.
  • Updated TTT eval is under 600s; observed range 367.2s to 370.1s.
  • All artifacts are under 16,000,000 bytes; seed-matched max observed 15,902,776.
  • TTT is score-first and single-pass: each chunk is evaluated before adaptation and not rescored.
  • No validation tokens are used for training or pre-quant adaptation.
  • No SLOT.
  • No n-gram cache and no logit bias.
  • No ETLB.

Key Techniques

  1. CaseOps SP8192 tokenizer and byte-sidecar path, using the lossless caps reserved tokenizer.
  2. 11-layer 512d XSA stack with U-Net skips, parallel decoder, depth recurrence, SparseAttnGate, BOS-fixed SmearGate, and LeakyReLU(0.5)^2 MLP.
  3. Polar-Express Newton-Schulz Muon plus the tuned quant/compression stack: GPTQ int6 matrices, int7 embeddings, int8 row gate, LQER asymmetric rank-4 correction, and per-group lrzip + brotli compression.
  4. Final deltas: MATRIX_LR=0.028, PHASED_TTT_PREFIX_DOCS=3500, PHASED_TTT_NUM_PHASES=1, and TTT_LORA_LR=0.00008.
  5. Score-first phased TTT stays on the post-quant model and scores every chunk before any update.

Reproduction

DATA_PATH=./data/datasets/fineweb10B_sp8192_lossless_caps_caseops_v1_reserved \
TOKENIZER_PATH=./data/tokenizers/fineweb_8192_bpe_lossless_caps_caseops_v1_reserved.model \
CASEOPS_ENABLED=1 \
VOCAB_SIZE=8192 \
ITERATIONS=20000 \
MAX_WALLCLOCK_SECONDS=600 \
TTT_ENABLED=1 \
PHASED_TTT_ENABLED=1 \
PHASED_TTT_PREFIX_DOCS=3500 \
PHASED_TTT_NUM_PHASES=1 \
TTT_LORA_LR=0.00008 \
EMBED_BITS=7 \
MATRIX_LR=0.028 \
MIN_LR=0.1 \
MLP_CLIP_SIGMAS=11.5 \
ATTN_CLIP_SIGMAS=13.0 \
EMBED_CLIP_SIGMAS=14.0 \
GRAD_CLIP_NORM=0.3 \
TTT_CHUNK_SIZE=48 \
WARMUP_STEPS=20 \
MUON_BACKEND_STEPS=5 \
GLOBAL_TTT_MOMENTUM=0.9 \
WARMDOWN_FRAC=0.85 \
BETA2=0.99 \
TTT_BETA2=0.99 \
TTT_WEIGHT_DECAY=0.5 \
TTT_LORA_RANK=80 \
SPARSE_ATTN_GATE_SCALE=0.5 \
GPTQ_RESERVE_SECONDS=0.5 \
GPTQ_CALIBRATION_BATCHES=16 \
VAL_LOSS_EVERY=0 \
TRAIN_LOG_EVERY=50 \
GATED_ATTN_QUANT_GATE=1 \
SPARSE_ATTN_GATE_ENABLED=1 \
GATE_WINDOW=12 \
SMEAR_GATE_ENABLED=1 \
LQER_ENABLED=1 \
LQER_ASYM_ENABLED=1 \
LQER_RANK=4 \
LQER_FACTOR_BITS=4 \
LQER_ASYM_GROUP=64 \
LQER_TOP_K=3 \
FUSED_CE_ENABLED=1 \
COMPRESSOR=pergroup \
NCCL_NET=Socket \
SEED=42 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

@simon-marcus simon-marcus changed the title Record candidate: CaseOps + Matrix-LR 0.028 + Phased TTT 3500 Record candidate: 1.06049 CaseOps + Matrix-LR 0.028 + Phased TTT 3500 Apr 29, 2026
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 29, 2026
- spec 060N: compound AWQ-lite (PR openai#1908) + 4 TTT phases + 3000 prefix
  + 2 global-SGD epochs, eval-only on 060A's final_model.pt. Single-shot
  compound to use openai#1918's ~205s eval-time slack; safe fallback drops
  GLOBAL_TTT_EPOCHS if wallclock blows.
- new idea 1925-matrix-lr-ttt-prefix-tune (PR openai#1925, hyperparam-only
  on openai#1855: MATRIX_LR=0.028 + PHASED_TTT_PREFIX_DOCS=3500 → 1.06109).
- new idea 1915-per-doc-lora-ttt (PR openai#1915, per-doc-only LoRA TTT
  discipline; parked as fallback if global-SGD class is ruled out).
- frontier scan: 21 new PRs (openai#1906-openai#1931). Headline: PRs openai#1908+openai#1918
  independently confirm AWQ-lite mixed-bit GPTQ pattern at ~1.0608 on
  openai#1855 base; openai#1925 hyperparam-only at 1.06109; openai#1923 Asymmetric Logit
  Rescale = empirical negative; openai#1929 banned SLOT+prequant-TTT.
- frontier-state.json: 21 PRs added; total 200.
- diary/2026-04-29-frontier-scan.md: full scan report.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
alertcat added a commit to alertcat/parameter-golf that referenced this pull request Apr 29, 2026
…ams)

After 4 parallel research agents reviewed 30+ open PRs and
compliance issues, two new findings:

1. PR openai#1923 (AsymLogit) flagged "empirical negative" by
   sunnypatneedi 4-29 frontier-scan, BUT only on PR openai#1855 base
   with default WD=1.0. Never tested on PR openai#1908 + WD=2.0 combo.
   V19's specific stack is NOT directly invalidated.

2. PR openai#1925 simon-marcus 1.06049 (3-seed verified, vs PR openai#1855
   base 1.06108 = -0.00059 BPB). Just 2 hparam env vars:
     MATRIX_LR 0.026 -> 0.028
     PHASED_TTT_PREFIX_DOCS 2500 -> 3500
   Orthogonal axis to AsymLogit (LR/TTT prefix vs logit head).

Adds two new scout scripts:
- run_v19c_stacked_scout.sh: PR openai#1908 + AsymLogit + simon-marcus
  + WD=2.0 (full stack, recommended first scout)
- run_v19b_simonmarcus_scout.sh: PR openai#1908 + simon-marcus + WD=2.0
  (ablation if V19c wins partially)

Decision rule (CaseOps val baseline 0.97651, community floor 0.0006):
  V19c < 0.97591 -> CLEAR WIN, run 3-seed
  V19c 0.97591-0.9755 -> borderline, ablate via V19a/V19b
  V19c > 0.9755 -> abandon stack, try Lead B (PR openai#1884)

Other research findings:
- PR openai#1898 SpinQuant flagged regression vs parent openai#1851 (skip)
- PR openai#1929 SLOT banned per openai#1722 precedent
- PR openai#1911 pre-quant TTT chain banned per openai#1735 precedent
- cocohearts 4-28 PR openai#1902 confirmed PR openai#1855 as official openai#1
- regina-openai + Alex Zhao 48h zero activity
- CaseOps de-facto legal (PR openai#1855 merged into chain)
@simon-marcus simon-marcus changed the title Record candidate: 1.06049 CaseOps + Matrix-LR 0.028 + Phased TTT 3500 Record candidate: 1.06032 CaseOps + Matrix-LR 0.028 + TTT n=1 Apr 29, 2026
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 30, 2026
…LoRA LR=8e-5) eval-only on spec 250 outputs
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request May 1, 2026
…LoRA LR=8e-5) eval-only on spec 250 outputs
@simon-marcus
Copy link
Copy Markdown
Author

@cocohearts Gently flagging a possible missed row: #1925.

I think it may have fallen into a scan-range gap. #1902 says it scanned PRs #1494-#1908, and the subsequent audit PR #2146 says it scanned #1944-#2140; so my #1925 may have "fallen between the chairs," as they say.

The chronology is a little hard to see from final PR states because both #1925 and #1945 were updated after opening, but here's how it looks:

  • #1925 opened 2026-04-29T11:36:39Z
  • #1925 commit 77c39308 landed 2026-04-29T14:24:29Z with mean 1.06049099
  • #1945 opened later at 2026-04-29T19:20:30Z
  • #1925 final eval update commit 7cd4308d landed 2026-04-29T20:26:14Z, improving the same PR to mean 1.06031736
  • #1945’s first strict-under-600s V21 v2 result appears to be commit 70067534 at 2026-04-29T21:45:35Z; the earlier 3f49b5e2 result had the seed-42 602.048s issue discussed here
  • I flagged #1925 on #1902 at 2026-04-29T22:18:56Z

I don’t think #1925 affects the final top row. But if technically acceptable, it seems like it would be a chronological frontier/support row after 77c39308. The final #1925 mean is 1.06031736 using the #1855 matched seeds 42/0/1234, improving all three matched seeds vs #1855.

No worries if I’m missing an exclusion rationale; just flagging because the scan ranges appear to skip this PR numerically. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant