Skip to content

SP8192 + CaseOps + Loop345 + Recur-Alpha + PhasedTTT#1766

Open
tashapais wants to merge 2 commits intoopenai:mainfrom
tashapais:submission/sp8192-caseops-recur-alpha
Open

SP8192 + CaseOps + Loop345 + Recur-Alpha + PhasedTTT#1766
tashapais wants to merge 2 commits intoopenai:mainfrom
tashapais:submission/sp8192-caseops-recur-alpha

Conversation

@tashapais
Copy link
Copy Markdown

Summary

Adds Recur-Alpha to the PR #1736 stack (CaseOps + GatedAttn + QuantGate + Loop345 + PhasedTTT).

Recur-Alpha is a learned scalar per looped block (init=0) that adds a weighted copy of each block's first-visit activation to its subsequent recurrence passes — a lightweight GRU-like carry inside the depth recurrence. The idea originates in PR #1714, where it was implemented on the older SP8192 stack but TTT evaluation was never completed. This PR is the first composition of Recur-Alpha with the CaseOps + phased TTT stack.

The only code change from PR #1736:

# Block.__init__
self.recur_alpha = nn.Parameter(torch.zeros(1))

# forward_logits + forward_ttt — carry dict in encoder/decoder loops
carry = {}
for i in enc_iter:
    x = block(...)
    if self.looping_active:
        if i in carry:
            x = x + self.blocks[i].recur_alpha.to(dtype=x.dtype) * carry[i]
        carry[i] = x
    skips.append(x)

# decoder non-parallel branch:
x = block(...)
if self.looping_active and i in carry:
    x = x + self.blocks[i].recur_alpha.to(dtype=x.dtype) * carry[i]

Cost: 3 parameters (one per looped block), ndim=1 → scalar AdamW, excluded from GPTQ and Muon. Artifact size impact: negligible.

Full Technique Stack

  1. SP8192 tokenizer
  2. CaseOps — bijective lossless case preprocessing; BPB on original UTF-8 bytes
  3. 3-Layer Depth Recurrence — layers 3, 4, 5 × 2 loops (17 virtual layers), activates at 35%
  4. Recur-Alpha — learned carry scalar per looped block (init=0) (novel)
  5. Gated Attention — per-head sigmoid output gate, init_std=0.01
  6. Quant Gate — int8-per-row quantization of attn_gate_w
  7. Parallel Residuals — GPT-J style from layer 8
  8. QK-Gain 5.0 — learned per-head query scalar
  9. Full-Hessian GPTQ — int6 matrices, int8 embeddings, SDClip
  10. MuonEq-R — row-normalized Muon + AdamW
  11. Phased TTT — score-first LoRA SGD, per-doc reset, cosine LR decay

Reproduction

pip install brotli sentencepiece
pip install flash_attn_3 --no-deps --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/
python prepare_caseops_data.py

SEED=42 CASEOPS_ENABLED=1 GATED_ATTN_ENABLED=1 GATED_ATTN_QUANT_GATE=1 \
  torchrun --standalone --nproc_per_node=8 train_gpt.py

Results pending on 8xH100 hardware.

Test plan

  • Run 3 seeds (42, 0, 1234) on 8xH100s
  • Verify training completes under 600s
  • Verify artifact under 16,000,000 bytes
  • Verify sliding-window + TTT eval under 600s
  • Report val_bpb for each seed

Credits

tashapais and others added 2 commits April 21, 2026 17:14
Adds Recur-Alpha (learned carry scalar per looped block, init=0) to the
PR openai#1736 CaseOps+GatedAttn+QuantGate+Loop345+PhasedTTT stack. The only
code change is 3 new nn.Parameter(zeros(1)) scalars in Block.__init__ and
carry-dict logic in both forward_logits and forward_ttt encoder/decoder
loops. Zero initialization preserves the base model at step 0.

Results pending on 8xH100 hardware.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rchitecture test

- RECUR_ALPHA_ENABLED=0 disables carry additions for ablation runs without
  changing the depth recurrence architecture; freezes recur_alpha params
- Logs recur_alpha values at loop activation and end of training so 1xH100
  smoke runs can confirm the scalars are learning
- test_architecture.py: CPU-only test (stubs FA3/triton) covering model
  instantiation, index layout, forward passes, gradient flow, and carry effect

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant