Skip to content

Non-record: Parcae Loop Injection + Gemma-style Attention + Gram NS#1674

Open
mikeapedia wants to merge 1 commit intoopenai:mainfrom
mikeapedia:nonrecord/parcae-gemma-gramns
Open

Non-record: Parcae Loop Injection + Gemma-style Attention + Gram NS#1674
mikeapedia wants to merge 1 commit intoopenai:mainfrom
mikeapedia:nonrecord/parcae-gemma-gramns

Conversation

@mikeapedia
Copy link
Copy Markdown

@mikeapedia mikeapedia commented Apr 16, 2026

Summary

Non-record research submission. Builds on PR #1648 (xIELU activation + per-layer QK gain convergence). No compute credits to verify on competition eval infrastructure — sharing these ideas for the community to build on.

Four techniques on top of the PR #1586 base with xIELU + QK gain:

1. Parcae Constrained Loop Injection

Inspired by Parcae — applies an SSM-inspired boundary condition at loop re-entry points. Instead of raw hidden state passthrough between depth-recurrence iterations, applies a learned decay + injection:

delta = softplus(loop_delta)           # guarantee delta > 0
A_bar = exp(delta * (-exp(loop_log_A)))  # A_bar ∈ (0, 1) by construction
B_bar = delta * loop_B
x = A_bar * x + B_bar * x0            # at each loop boundary

Three per-dim learnable parameters (loop_log_A, loop_delta, loop_B) control how much the model retains from the previous loop pass vs re-injects the original residual stream. The softplus on delta guarantees A_bar < 1 (stable decay). loop_B initialized at 0.1 to encourage x0 re-injection from the start.

Zero throughput overhead — these are per-layer scalars, not per-channel.

We also wired up eval/TTT to support more loop iterations than training (eval_extra_loops), enabling additional test-time compute scaling via set_eval_loop_indices(). log_parcae_converged() prints A_bar/B_bar statistics at end of training for convergence tracking.

2. Gram NS for High-Aspect-Ratio Banks + NS Steps 5→4

Based on Gram Newton-Schulz from Dao AI Lab. Added gram_newton_schulz5() that iterates on the smaller n×n Gram matrix (float32) instead of the full rectangular matrix. Dispatched automatically based on aspect ratio:

Bank Shape per layer α = m/n Method
q_bank, out_bank 512×512 1.0 Standard NS
k_bank 256×512 2.0 Standard NS
mlp_up_bank 2048×512 4.0 Gram NS
mlp_down_bank 512×2048 4.0 Gram NS

Cost analysis (without symmetric GEMM): Standard NS costs n³(10α + 5), Gram NS costs n³(4α + 20). Break-even at α = 2.5 — only MLP banks (α = 4) benefit, saving ~20% on those iterations.

Gram NS includes a restart at step 2 to kill spurious negative eigenvalues.

Also reduced muon_backend_steps from 5 → 4 (sufficient with current architecture).

3. Gemma-style Global/Local Attention

Inspired by Gemma 4's interleaved attention pattern:

  • Global layers (default: 4, 9, 10): Full causal attention with partial RoPE (rope_dims out of head_dim). Partial RoPE avoids high-frequency positional noise at long range.
  • Local layers (all others): Sliding window attention (local_window_size=512) with full RoPE (all dims). Full positional precision within the window.

Per-layer window_size passed to both flash_attn_varlen_func and flash_attn_3_func. This is a direct throughput improvement — local layers only attend within the window instead of full sequence.

4. KV-Tying on Global Attention Layers

On global attention layers, K and V share the same weight matrix (v_bank entry omitted, K weights serve double duty). This:

  • Saves num_global_layers × kv_dim × model_dim parameters (3 × 256 × 512 = 393K params)
  • Reduces artifact size proportionally — freed bytes can be redeployed to less aggressive quantization clipping or increasing KV parameters
  • Serialization/deserialization handles the asymmetric V bank automatically

The separate Q/K/V/O bank structure (vs PR #1648's merged QO/KV banks) enables this selective tying. QKV weights are concatenated into a single qkv_w tensor at forward time for a fused F.linear call.

How we differ from Gemma 4's K=V tying:

Gemma 4 (global layers) Ours (global layers)
Query heads varies by model 8
KV heads varies (much fewer) 4
GQA ratio 8:1 (8 query per KV head) 2:1 (2 query per KV head)
Key dim doubled to compensate standard (head_dim=64)
K=V tying yes yes

Gemma does two things to compensate for K=V tying that we don't:

  1. Much more aggressive GQA on global layers (8:1) — fewer KV heads, but each one is more "important" and gets more capacity
  2. Doubled key dimensions — the shared K/V projection has 2× the feature space to work with, so one matrix can better serve both roles

We're tying K=V with a relatively mild 2:1 GQA and standard key dims. That means our shared projection has less capacity to serve both purposes. With 4 KV heads at dim=64, each head has to produce a 64-dim vector that simultaneously works for similarity matching (after RoPE + RMS norm) and content carrying (raw). That's a tighter constraint than Gemma faces.

Additional Changes

  • .contiguous() fix in Muon all_gather_into_tensor for torch 2.11 compatibility

Results

⚠️ Not verified on competition eval infrastructure — no eval, no TTT, no quantization roundtrip. Out of compute credits. All techniques have been tested for training correctness on 1×H100 and 8×H100.

Builds on PR openai#1648 (xIELU activation + per-layer QK gain). Adds four
techniques for the community to explore:

1. Parcae constrained loop injection (SSM-inspired loop boundaries)
2. Gram NS for high-aspect-ratio MLP banks (α≥2.5 breakeven) + NS 5→4
3. Gemma-style global/local attention with sliding window
4. KV-tying on global attention layers

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@mikeapedia
Copy link
Copy Markdown
Author

Sharing the ideas and code with the community in case anyone wants to incorporate these into their own experiments. My only request is that if you test them and they aren't helpful, please share those findings so we can all learn from them.
cc: @dexhunter @bigbag @abaybektursun @clarkkev @msisovic @samacqua

@SPThole
Copy link
Copy Markdown

SPThole commented Apr 17, 2026

I had also experimented with a Gemma-style attention mechanism (sliding window local + global). It showed better convergence compared to parallel residual architectures. However, I observed that due to Flash Attention limitations—particularly with the sliding window local attention—the implementation becomes slower. As a result, within a limited training budget in terms of steps, the model doesn’t reach lower loss values. If we can optimize this and make it faster, it has the potential to outperform current SOTA architectures on leaderboards. I had to pause the experiments due to a lack of compute credits. WIll push my experimentation (on 1 H100) as pull request and tag here again.

anthony-maio added a commit to anthony-maio/parameter-golf that referenced this pull request Apr 18, 2026
…add RUNBOOK.md

Two corrections to .claude/skills/parameter-golf/SKILL.md:

1. DO NOT upgrade torch. Template ships torch 2.8.0. Upgrading to 2.9.1
   (which the skill previously recommended) kills torch.compile speedup --
   observed 2026-04-17 a pod went from 213ms eager scout to 220ms compiled
   production, meaning compile contributed zero. New section 3b under
   provisioning gotchas documents the rule and the symptoms.

2. Pretokenized shards section expanded. MATCHED_FINEWEB_REPO_ID=
   kevclark/parameter-golf is mandatory: pretokenized shards for every
   vocab variant, no local tokenization needed. This is the single biggest
   operational enabler. Also noted we don't use a Runpod network volume --
   DC pin isn't worth it, fresh download each pod is standard.

RUNBOOK.md added at the slot-LoRA record dir. Self-contained plan covers:
- Pod provisioning loop across DCs
- One-shot bootstrap that obeys the no-torch-upgrade rule
- Job 1 (slot-LoRA sweep) with sweep spec and expected output
- Job 2 (PR openai#1674 friend-check) with inspection-first workflow
- Shared-pod path
- Final cleanup
@User123331
Copy link
Copy Markdown

Parcae seems interesting, but does it work? Is it alright if I experiment with this? I still have some few credits.

@msisovic
Copy link
Copy Markdown
Contributor

msisovic commented Apr 18, 2026

FWIW, I did a few experiments with a similar arch, pre + loop+ coda, but inspired by Geiping et al (https://arxiv.org/abs/2502.05171), which didn't work, better than current sota, at least at that time, because it allowed for much more params, which if you wanted to use ate up a lot of the training time. It was close to SOTA though. Also, it inspired my MiniDepthRecurrence technique: #1204 Worth revisiting with Parcae IMO

mikeapedia added a commit to mikeapedia/parameter-golf-1 that referenced this pull request Apr 19, 2026
Sliding-window eval only (TTT_ENABLED=0 by default) on standard SP8192.
Beats merged non-casefold SOTA PR openai#1493 (1.0810) by 0.00394 BPB without
any test-time adaptation. Single seed 1337; compute-constrained
non-record submission — VM went down before the run log could be pushed
so it is not attached. Metrics were observed during the session.

Architecture: PR openai#1674 (@mikeapedia) base (Parcae constrained loop
injection, Gemma-style global/local attention, Gram Newton-Schulz) +
PR openai#1530 (@samacqua) varlen attention + fused MLP triton kernel +
AttnOutGate (PR openai#1667 @MarioPaerle) + SmearGate (@KellerJordan concept,
@MarioPaerle reintroduction) + new layered local sliding windows
(512 on early/loop layers, 1024 on post-loop layers, split at index 6).

KV-tying on globals dropped vs PR openai#1674. TTT scaffolding (phased
global-SGD + per-doc LoRA, from PR openai#1693 lineage) remains in the file
for experiments but is disabled by default for this submission.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants