Non-record: Parcae Loop Injection + Gemma-style Attention + Gram NS by mikeapedia · Pull Request #1674 · openai/parameter-golf

mikeapedia · 2026-04-16T17:53:52Z

Summary

Non-record research submission. Builds on PR #1648 (xIELU activation + per-layer QK gain convergence). No compute credits to verify on competition eval infrastructure — sharing these ideas for the community to build on.

Four techniques on top of the PR #1586 base with xIELU + QK gain:

1. Parcae Constrained Loop Injection

Inspired by Parcae — applies an SSM-inspired boundary condition at loop re-entry points. Instead of raw hidden state passthrough between depth-recurrence iterations, applies a learned decay + injection:

delta = softplus(loop_delta)           # guarantee delta > 0
A_bar = exp(delta * (-exp(loop_log_A)))  # A_bar ∈ (0, 1) by construction
B_bar = delta * loop_B
x = A_bar * x + B_bar * x0            # at each loop boundary

Three per-dim learnable parameters (loop_log_A, loop_delta, loop_B) control how much the model retains from the previous loop pass vs re-injects the original residual stream. The softplus on delta guarantees A_bar < 1 (stable decay). loop_B initialized at 0.1 to encourage x0 re-injection from the start.

Zero throughput overhead — these are per-layer scalars, not per-channel.

We also wired up eval/TTT to support more loop iterations than training (eval_extra_loops), enabling additional test-time compute scaling via set_eval_loop_indices(). log_parcae_converged() prints A_bar/B_bar statistics at end of training for convergence tracking.

2. Gram NS for High-Aspect-Ratio Banks + NS Steps 5→4

Based on Gram Newton-Schulz from Dao AI Lab. Added gram_newton_schulz5() that iterates on the smaller n×n Gram matrix (float32) instead of the full rectangular matrix. Dispatched automatically based on aspect ratio:

Bank	Shape per layer	α = m/n	Method
q_bank, out_bank	512×512	1.0	Standard NS
k_bank	256×512	2.0	Standard NS
mlp_up_bank	2048×512	4.0	Gram NS ✓
mlp_down_bank	512×2048	4.0	Gram NS ✓

Cost analysis (without symmetric GEMM): Standard NS costs n³(10α + 5), Gram NS costs n³(4α + 20). Break-even at α = 2.5 — only MLP banks (α = 4) benefit, saving ~20% on those iterations.

Gram NS includes a restart at step 2 to kill spurious negative eigenvalues.

Also reduced muon_backend_steps from 5 → 4 (sufficient with current architecture).

3. Gemma-style Global/Local Attention

Inspired by Gemma 4's interleaved attention pattern:

Global layers (default: 4, 9, 10): Full causal attention with partial RoPE (rope_dims out of head_dim). Partial RoPE avoids high-frequency positional noise at long range.
Local layers (all others): Sliding window attention (local_window_size=512) with full RoPE (all dims). Full positional precision within the window.

Per-layer window_size passed to both flash_attn_varlen_func and flash_attn_3_func. This is a direct throughput improvement — local layers only attend within the window instead of full sequence.

4. KV-Tying on Global Attention Layers

On global attention layers, K and V share the same weight matrix (v_bank entry omitted, K weights serve double duty). This:

Saves num_global_layers × kv_dim × model_dim parameters (3 × 256 × 512 = 393K params)
Reduces artifact size proportionally — freed bytes can be redeployed to less aggressive quantization clipping or increasing KV parameters
Serialization/deserialization handles the asymmetric V bank automatically

The separate Q/K/V/O bank structure (vs PR #1648's merged QO/KV banks) enables this selective tying. QKV weights are concatenated into a single qkv_w tensor at forward time for a fused F.linear call.

How we differ from Gemma 4's K=V tying:

	Gemma 4 (global layers)	Ours (global layers)
Query heads	varies by model	8
KV heads	varies (much fewer)	4
GQA ratio	8:1 (8 query per KV head)	2:1 (2 query per KV head)
Key dim	doubled to compensate	standard (head_dim=64)
K=V tying	yes	yes

Gemma does two things to compensate for K=V tying that we don't:

Much more aggressive GQA on global layers (8:1) — fewer KV heads, but each one is more "important" and gets more capacity
Doubled key dimensions — the shared K/V projection has 2× the feature space to work with, so one matrix can better serve both roles

We're tying K=V with a relatively mild 2:1 GQA and standard key dims. That means our shared projection has less capacity to serve both purposes. With 4 KV heads at dim=64, each head has to produce a 64-dim vector that simultaneously works for similarity matching (after RoPE + RMS norm) and content carrying (raw). That's a tighter constraint than Gemma faces.

Additional Changes

.contiguous() fix in Muon all_gather_into_tensor for torch 2.11 compatibility

Results

⚠️ Not verified on competition eval infrastructure — no eval, no TTT, no quantization roundtrip. Out of compute credits. All techniques have been tested for training correctness on 1×H100 and 8×H100.

Builds on PR openai#1648 (xIELU activation + per-layer QK gain). Adds four techniques for the community to explore: 1. Parcae constrained loop injection (SSM-inspired loop boundaries) 2. Gram NS for high-aspect-ratio MLP banks (α≥2.5 breakeven) + NS 5→4 3. Gemma-style global/local attention with sliding window 4. KV-tying on global attention layers Co-Authored-By: Claude Opus 4.6 <[email protected]>

mikeapedia · 2026-04-16T18:16:47Z

Sharing the ideas and code with the community in case anyone wants to incorporate these into their own experiments. My only request is that if you test them and they aren't helpful, please share those findings so we can all learn from them.
cc: @dexhunter @bigbag @abaybektursun @clarkkev @msisovic @samacqua

SPThole · 2026-04-17T16:48:48Z

I had also experimented with a Gemma-style attention mechanism (sliding window local + global). It showed better convergence compared to parallel residual architectures. However, I observed that due to Flash Attention limitations—particularly with the sliding window local attention—the implementation becomes slower. As a result, within a limited training budget in terms of steps, the model doesn’t reach lower loss values. If we can optimize this and make it faster, it has the potential to outperform current SOTA architectures on leaderboards. I had to pause the experiments due to a lack of compute credits. WIll push my experimentation (on 1 H100) as pull request and tag here again.

…add RUNBOOK.md Two corrections to .claude/skills/parameter-golf/SKILL.md: 1. DO NOT upgrade torch. Template ships torch 2.8.0. Upgrading to 2.9.1 (which the skill previously recommended) kills torch.compile speedup -- observed 2026-04-17 a pod went from 213ms eager scout to 220ms compiled production, meaning compile contributed zero. New section 3b under provisioning gotchas documents the rule and the symptoms. 2. Pretokenized shards section expanded. MATCHED_FINEWEB_REPO_ID= kevclark/parameter-golf is mandatory: pretokenized shards for every vocab variant, no local tokenization needed. This is the single biggest operational enabler. Also noted we don't use a Runpod network volume -- DC pin isn't worth it, fresh download each pod is standard. RUNBOOK.md added at the slot-LoRA record dir. Self-contained plan covers: - Pod provisioning loop across DCs - One-shot bootstrap that obeys the no-torch-upgrade rule - Job 1 (slot-LoRA sweep) with sweep spec and expected output - Job 2 (PR openai#1674 friend-check) with inspection-first workflow - Shared-pod path - Final cleanup

User123331 · 2026-04-18T20:29:05Z

Parcae seems interesting, but does it work? Is it alright if I experiment with this? I still have some few credits.

msisovic · 2026-04-18T20:39:19Z

FWIW, I did a few experiments with a similar arch, pre + loop+ coda, but inspired by Geiping et al (https://arxiv.org/abs/2502.05171), which didn't work, better than current sota, at least at that time, because it allowed for much more params, which if you wanted to use ate up a lot of the training time. It was close to SOTA though. Also, it inspired my MiniDepthRecurrence technique: #1204 Worth revisiting with Parcae IMO

@mikeapedia

Sliding-window eval only (TTT_ENABLED=0 by default) on standard SP8192. Beats merged non-casefold SOTA PR openai#1493 (1.0810) by 0.00394 BPB without any test-time adaptation. Single seed 1337; compute-constrained non-record submission — VM went down before the run log could be pushed so it is not attached. Metrics were observed during the session. Architecture: PR openai#1674 (@mikeapedia) base (Parcae constrained loop injection, Gemma-style global/local attention, Gram Newton-Schulz) + PR openai#1530 (@samacqua) varlen attention + fused MLP triton kernel + AttnOutGate (PR openai#1667 @MarioPaerle) + SmearGate (@KellerJordan concept, @MarioPaerle reintroduction) + new layered local sliding windows (512 on early/loop layers, 1024 on post-loop layers, split at index 6). KV-tying on globals dropped vs PR openai#1674. TTT scaffolding (phased global-SGD + per-doc LoRA, from PR openai#1693 lineage) remains in the file for experiments but is disabled by default for this submission.

mikeapedia mentioned this pull request Apr 19, 2026

Non-record: Neural Base Model, No TTT — Parcae + Gates + Layered Windows (val_bpb 1.07706) #1728

Open

7 tasks

User123331 mentioned this pull request Apr 30, 2026

Record: Parcae px43 embed7 clip1300 (val_bpb = 1.0878) #1995

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Parcae Loop Injection + Gemma-style Attention + Gram NS#1674

Non-record: Parcae Loop Injection + Gemma-style Attention + Gram NS#1674
mikeapedia wants to merge 1 commit intoopenai:mainfrom
mikeapedia:nonrecord/parcae-gemma-gramns

mikeapedia commented Apr 16, 2026 •

edited

Loading

Uh oh!

mikeapedia commented Apr 16, 2026

Uh oh!

SPThole commented Apr 17, 2026 •

edited

Loading

Uh oh!

User123331 commented Apr 18, 2026

Uh oh!

msisovic commented Apr 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

mikeapedia commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. Parcae Constrained Loop Injection

2. Gram NS for High-Aspect-Ratio Banks + NS Steps 5→4

3. Gemma-style Global/Local Attention

4. KV-Tying on Global Attention Layers

Additional Changes

Results

Uh oh!

mikeapedia commented Apr 16, 2026

Uh oh!

SPThole commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

User123331 commented Apr 18, 2026

Uh oh!

msisovic commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mikeapedia commented Apr 16, 2026 •

edited

Loading

SPThole commented Apr 17, 2026 •

edited

Loading

msisovic commented Apr 18, 2026 •

edited

Loading