Skip to content

Record: SP8192 + SLOT scored-position + cross-batch EMA warmup: val_bpb=0.94569#1929

Open
davie2009kh wants to merge 2 commits intoopenai:mainfrom
davie2009kh:submission/slot-scored-position-ema
Open

Record: SP8192 + SLOT scored-position + cross-batch EMA warmup: val_bpb=0.94569#1929
davie2009kh wants to merge 2 commits intoopenai:mainfrom
davie2009kh:submission/slot-scored-position-ema

Conversation

@davie2009kh
Copy link
Copy Markdown

SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Pre-Quant TTT + Scored-Position SLOT

val_bpb: 0.94569 (3-seed mean: seeds 1337, 42, 2025)

Seed val_bpb
1337 0.95036466
42 0.96014543
2025 0.92656976

What this submission does

This builds on the kilojoules/alertcat PR #1738 stack and adds Scored-Position SLOT at evaluation time:

  • Per-sample delta [bsz, 1, d_model] and logit_bias [bsz, 1, vocab] in fp32
  • AdamW optimizer, 24 steps, cosine LR 0.008 → 0.0008
  • Optimization target: scored positions only (past tokens, no look-ahead)
  • Cross-batch EMA warmup (decay=0.5): converged delta/logit_bias means are carried forward as initialization for the next batch, giving each batch a head start on convergence at zero extra parameter cost
  • SLOT runs only during eval_val_sliding on the quantized model; training is unmodified

Training wall-clock: ~588s (within 600s budget).
Eval wall-clock: ~1405s (SLOT optimization over full val set).
Artifact size: ~15.87MB (within 16MB budget).

Base stack

SLOT lineage

Scored-Position SLOT is inspired by @resouer PR #1229. Key differences: per-sample (not shared) delta, cross-batch EMA prior warmup, and restriction to scored positions only via a boolean mask.

leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 29, 2026
- spec 060N: compound AWQ-lite (PR openai#1908) + 4 TTT phases + 3000 prefix
  + 2 global-SGD epochs, eval-only on 060A's final_model.pt. Single-shot
  compound to use openai#1918's ~205s eval-time slack; safe fallback drops
  GLOBAL_TTT_EPOCHS if wallclock blows.
- new idea 1925-matrix-lr-ttt-prefix-tune (PR openai#1925, hyperparam-only
  on openai#1855: MATRIX_LR=0.028 + PHASED_TTT_PREFIX_DOCS=3500 → 1.06109).
- new idea 1915-per-doc-lora-ttt (PR openai#1915, per-doc-only LoRA TTT
  discipline; parked as fallback if global-SGD class is ruled out).
- frontier scan: 21 new PRs (openai#1906-openai#1931). Headline: PRs openai#1908+openai#1918
  independently confirm AWQ-lite mixed-bit GPTQ pattern at ~1.0608 on
  openai#1855 base; openai#1925 hyperparam-only at 1.06109; openai#1923 Asymmetric Logit
  Rescale = empirical negative; openai#1929 banned SLOT+prequant-TTT.
- frontier-state.json: 21 PRs added; total 200.
- diary/2026-04-29-frontier-scan.md: full scan report.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
alertcat added a commit to alertcat/parameter-golf that referenced this pull request Apr 29, 2026
…ams)

After 4 parallel research agents reviewed 30+ open PRs and
compliance issues, two new findings:

1. PR openai#1923 (AsymLogit) flagged "empirical negative" by
   sunnypatneedi 4-29 frontier-scan, BUT only on PR openai#1855 base
   with default WD=1.0. Never tested on PR openai#1908 + WD=2.0 combo.
   V19's specific stack is NOT directly invalidated.

2. PR openai#1925 simon-marcus 1.06049 (3-seed verified, vs PR openai#1855
   base 1.06108 = -0.00059 BPB). Just 2 hparam env vars:
     MATRIX_LR 0.026 -> 0.028
     PHASED_TTT_PREFIX_DOCS 2500 -> 3500
   Orthogonal axis to AsymLogit (LR/TTT prefix vs logit head).

Adds two new scout scripts:
- run_v19c_stacked_scout.sh: PR openai#1908 + AsymLogit + simon-marcus
  + WD=2.0 (full stack, recommended first scout)
- run_v19b_simonmarcus_scout.sh: PR openai#1908 + simon-marcus + WD=2.0
  (ablation if V19c wins partially)

Decision rule (CaseOps val baseline 0.97651, community floor 0.0006):
  V19c < 0.97591 -> CLEAR WIN, run 3-seed
  V19c 0.97591-0.9755 -> borderline, ablate via V19a/V19b
  V19c > 0.9755 -> abandon stack, try Lead B (PR openai#1884)

Other research findings:
- PR openai#1898 SpinQuant flagged regression vs parent openai#1851 (skip)
- PR openai#1929 SLOT banned per openai#1722 precedent
- PR openai#1911 pre-quant TTT chain banned per openai#1735 precedent
- cocohearts 4-28 PR openai#1902 confirmed PR openai#1855 as official openai#1
- regina-openai + Alex Zhao 48h zero activity
- CaseOps de-facto legal (PR openai#1855 merged into chain)
@anmarhindi
Copy link
Copy Markdown

This appears to be a score-before-update violation. As described, pre_quant_adamw_ttt performs AdamW adaptation over the full validation stream before scoring the same stream. Therefore the model used to score token x_t has already been updated using x_t.

Under the rules, the predictive distribution for x_t must be fixed before observing or updating on x_t; only after the score is recorded may x_t be used to update state for future tokens. If the 28-epoch validation adaptation happens before scoring, this is score-after-adapt TTT, not legal score-first adaptation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants