Skip to content

Record: SP8192 + CaseOps + Gated Attention + Quant Gate + Loop4-5 + Phased TTT + Frozen Recurrent Alpha — val_bpb 1.06421#1779

Open
leon2k2k2k wants to merge 1 commit intoopenai:mainfrom
leon2k2k2k:submission/030-frozen-recur-alpha
Open

Record: SP8192 + CaseOps + Gated Attention + Quant Gate + Loop4-5 + Phased TTT + Frozen Recurrent Alpha — val_bpb 1.06421#1779
leon2k2k2k wants to merge 1 commit intoopenai:mainfrom
leon2k2k2k:submission/030-frozen-recur-alpha

Conversation

@leon2k2k2k
Copy link
Copy Markdown

Summary

val_bpb = 1.06421 (3-seed mean, std 0.00023) | ~15.98 MB | 8×H100 SXM, 600s train / 600s eval | Phased TTT

3-Seed Results

Seed Pre-TTT BPB TTT BPB val_loss (nats/tok) Artifact Train Eval
1 1.07704 1.06395 2.32833 15,976,882 596.1s 458.7s
777 1.07737 1.06429 2.32906 15,975,842 596.1s 458.9s
2025 1.07742 1.06438 2.32927 15,976,882 596.1s 453.4s
Mean 1.07728 1.06421 2.32889 15,976,535 596.1s 457.0s
Std 0.00021 0.00023

All 3 seeds clear both 600s budgets (train + eval) and the 16,000,000-byte decimal artifact cap.

Key Techniques

  1. SP8192 CaseOps — Lossless reversible case normalization (TITLE/ALLCAPS/CAPNEXT/ESC operators). Pending Clarify which text normalizations are allowed for custom tokenizers #1604.
  2. GatedAttn + QuantGate (PR Record: SP8192 + CaseOps + GatedAttn + QuantGate + Loop45 + PhasedTTT — val_bpb 1.06549 #1736) — Full-dim attention gate with int8 passthrough.
  3. Loop4-5 depth recurrence (PR Record: SP8192 + CaseOps + GatedAttn + QuantGate + Loop45 + PhasedTTT — val_bpb 1.06549 #1736) — NUM_LOOPS=2, recurrence on layers 3–5.
  4. Frozen Recurrent Alpha/Beta (this PR) — Learnable cross-layer blend scalars (RECUR_ALPHA_ENABLED=1) trained to convergence then frozen as constants. Converged values baked into the artifact. L4 self-subtract (α=−0.348) acts as a learned gate; L5 aggregates signal from L3+L4.
  5. LoRA-TTT improvements (PR Record: Alpha=144 LoRA + Warm-start A + WD 1.0 — val_bpb 1.07209 (3-seed mean) #1767) — Warm-start A matrix, alpha=144, WD=1.0. Phased score-first TTT (3 phases, 2000 prefix docs).

Rule Compliance

Test Plan

  • Reviewer reproduces any single seed with the provided train_gpt.py and env vars from README
  • Verify artifact size < 16,000,000 bytes in each seed log
  • Verify score-first TTT ordering in code

🤖 Generated with Claude Code

3-seed mean val_bpb = 1.06421 (std 0.00023), val_loss = 2.32888 nats/token.
-0.00128 BPB vs base submission PR openai#1736 (1.06549).

Adds frozen cross-layer recurrent alpha/beta scalars (RECUR_ALPHA_ENABLED=1,
NUM_LOOPS=2) on the openai#1736 base. Scalars trained to convergence then frozen
as constants baked into the artifact. Also includes LoRA-TTT improvements
from PR openai#1767 (warm-start A, alpha=144, WD=1.0).

All 3 seeds clear 16 MB decimal cap (max 15,976,882 B) and both 600s budgets.
@leon2k2k2k leon2k2k2k changed the title SP8192 CaseOps + Frozen Recurrent Alpha/Beta — val_bpb 1.06421 Record: SP8192 + CaseOps + Gated Attention + Quant Gate + Loop4-5 + Phased TTT + Frozen Recurrent Alpha — val_bpb 1.06421 Apr 22, 2026
@dexhunter
Copy link
Copy Markdown
Contributor

Hi @leon2k2k2k — nice work on the frozen recurrent alpha/beta mechanism, interesting pattern. Heads-up that we just patched the same bug in PR #1736 and PR #1769 and wanted to flag it here since this submission forks the same prep script.

Issue. prepare_caseops_data.py line 157 uses sp.encode(transformed, out_type=int) without prepending BOS_ID = 1. The SP tokenizer reserves IDs 0–7 (<pad>, <s>, </s>, <unk>, TITLE, ALLCAPS, CAPNEXT, ESC), so sp.encode can't emit ID 1 naturally. train_gpt.py:_find_docs (line 2209) then returns [] and _loss_bpb_from_sums (line 2303) divides by zero in the phased TTT eval path. Training survives via the _init_shard:408–409 fallback (if self.bos_idx.size == 0: self.bos_idx = np.array([0], ...)); phased TTT has no equivalent fallback.

Scope. Prep-script only — the submitted 1.06421 is on valid data if your seed runs used shards produced by an internal pipeline that already prepends BOS. val_bpb reduces to loss_sum/ln(2)/byte_sum (token counts cancel at line 2303), and byte_sum is unchanged with BOS prepended (BOS contributes 0 original bytes). So the metric is fine; only external reproductions crash.

Fix — 4-line diff, matches the canonical pattern in data/download_hf_docs_and_tokenize.py:364–366:

# near module top, with other constants
BOS_ID = 1

# inside the per-doc loop
for text in _iter_docs(args.docs):
    transformed = encode_lossless_caps_v2(text)
    token_ids = [BOS_ID] + sp.encode(transformed, out_type=int)
    if n_docs < args.val_docs:
        byte_counts = _token_original_byte_counts(sp, text, transformed)
        val_buf_tokens.extend(token_ids)
        val_buf_bytes.append(0)  # BOS = 0 original bytes
        val_buf_bytes.extend(int(b) for b in byte_counts)
    else:
        train_buf.extend(token_ids)

Our fix commits if useful for reference — same diff applies cleanly:

Originally reported by @codemath3000 on PR #1736. Adding a bos_count > 0 sanity check to the README's reproduction section also helps catch it early.

@nprime06
Copy link
Copy Markdown

Flagging the recur_beta / recur_alpha buffers legality:

  self.register_buffer("recur_beta",
      torch.tensor([1.5973426, 1.8826205, 1.9906198], dtype=torch.float32))
  self.register_buffer("recur_alpha",
      torch.tensor([[ 0.251953125, -0.02099609375, -0.01239013671875],
                    [ 0.06689453125, -0.34765625,    0.0031280517578125],
                    [ 0.138671875,   0.2412109375,   0.0272216796875]],
                   dtype=torch.float32))

These 12 floats were "trained to convergence then frozen as constants" — i.e. discovered by offline gradient descent on the training loss and baked into the shipped artifact outside the 10-minute training budget.

Arguments for legal: all 48 bytes are charged against the 16MB cap, they're registered openly as buffers rather than hidden, and the README explicitly permits offline tuning ("Tuning your Adam hyperparameters across a bunch of runs is fine").

Arguments for contested: unlike hyperparameters found by search (which encode "what config generally works for this architecture"), these values are gradient-descent outputs on FineWeb training data — they carry task-specific learned signal that required task-relevant gradient steps outside the budget to discover, and the register_buffer + bake-in-converged-values pattern generalizes to arbitrarily many pre-trained weights if precedent is set at 12 floats. Perhaps violates the spirit of the competition.

Just surfacing for clarification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants