Non-record: Parcae Loop Injection + Gemma-style Attention + Gram NS#1674
Non-record: Parcae Loop Injection + Gemma-style Attention + Gram NS#1674mikeapedia wants to merge 1 commit intoopenai:mainfrom
Conversation
Builds on PR openai#1648 (xIELU activation + per-layer QK gain). Adds four techniques for the community to explore: 1. Parcae constrained loop injection (SSM-inspired loop boundaries) 2. Gram NS for high-aspect-ratio MLP banks (α≥2.5 breakeven) + NS 5→4 3. Gemma-style global/local attention with sliding window 4. KV-tying on global attention layers Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
Sharing the ideas and code with the community in case anyone wants to incorporate these into their own experiments. My only request is that if you test them and they aren't helpful, please share those findings so we can all learn from them. |
|
I had also experimented with a Gemma-style attention mechanism (sliding window local + global). It showed better convergence compared to parallel residual architectures. However, I observed that due to Flash Attention limitations—particularly with the sliding window local attention—the implementation becomes slower. As a result, within a limited training budget in terms of steps, the model doesn’t reach lower loss values. If we can optimize this and make it faster, it has the potential to outperform current SOTA architectures on leaderboards. I had to pause the experiments due to a lack of compute credits. WIll push my experimentation (on 1 H100) as pull request and tag here again. |
…add RUNBOOK.md Two corrections to .claude/skills/parameter-golf/SKILL.md: 1. DO NOT upgrade torch. Template ships torch 2.8.0. Upgrading to 2.9.1 (which the skill previously recommended) kills torch.compile speedup -- observed 2026-04-17 a pod went from 213ms eager scout to 220ms compiled production, meaning compile contributed zero. New section 3b under provisioning gotchas documents the rule and the symptoms. 2. Pretokenized shards section expanded. MATCHED_FINEWEB_REPO_ID= kevclark/parameter-golf is mandatory: pretokenized shards for every vocab variant, no local tokenization needed. This is the single biggest operational enabler. Also noted we don't use a Runpod network volume -- DC pin isn't worth it, fresh download each pod is standard. RUNBOOK.md added at the slot-LoRA record dir. Self-contained plan covers: - Pod provisioning loop across DCs - One-shot bootstrap that obeys the no-torch-upgrade rule - Job 1 (slot-LoRA sweep) with sweep spec and expected output - Job 2 (PR openai#1674 friend-check) with inspection-first workflow - Shared-pod path - Final cleanup
|
Parcae seems interesting, but does it work? Is it alright if I experiment with this? I still have some few credits. |
|
FWIW, I did a few experiments with a similar arch, pre + loop+ coda, but inspired by Geiping et al (https://arxiv.org/abs/2502.05171), which didn't work, better than current sota, at least at that time, because it allowed for much more params, which if you wanted to use ate up a lot of the training time. It was close to SOTA though. Also, it inspired my MiniDepthRecurrence technique: #1204 Worth revisiting with Parcae IMO |
Sliding-window eval only (TTT_ENABLED=0 by default) on standard SP8192. Beats merged non-casefold SOTA PR openai#1493 (1.0810) by 0.00394 BPB without any test-time adaptation. Single seed 1337; compute-constrained non-record submission — VM went down before the run log could be pushed so it is not attached. Metrics were observed during the session. Architecture: PR openai#1674 (@mikeapedia) base (Parcae constrained loop injection, Gemma-style global/local attention, Gram Newton-Schulz) + PR openai#1530 (@samacqua) varlen attention + fused MLP triton kernel + AttnOutGate (PR openai#1667 @MarioPaerle) + SmearGate (@KellerJordan concept, @MarioPaerle reintroduction) + new layered local sliding windows (512 on early/loop layers, 1024 on post-loop layers, split at index 6). KV-tying on globals dropped vs PR openai#1674. TTT scaffolding (phased global-SGD + per-doc LoRA, from PR openai#1693 lineage) remains in the file for experiments but is disabled by default for this submission.
Summary
Non-record research submission. Builds on PR #1648 (xIELU activation + per-layer QK gain convergence). No compute credits to verify on competition eval infrastructure — sharing these ideas for the community to build on.
Four techniques on top of the PR #1586 base with xIELU + QK gain:
1. Parcae Constrained Loop Injection
Inspired by Parcae — applies an SSM-inspired boundary condition at loop re-entry points. Instead of raw hidden state passthrough between depth-recurrence iterations, applies a learned decay + injection:
Three per-dim learnable parameters (
loop_log_A,loop_delta,loop_B) control how much the model retains from the previous loop pass vs re-injects the original residual stream. The softplus on delta guarantees A_bar < 1 (stable decay).loop_Binitialized at 0.1 to encourage x0 re-injection from the start.Zero throughput overhead — these are per-layer scalars, not per-channel.
We also wired up eval/TTT to support more loop iterations than training (
eval_extra_loops), enabling additional test-time compute scaling viaset_eval_loop_indices().log_parcae_converged()prints A_bar/B_bar statistics at end of training for convergence tracking.2. Gram NS for High-Aspect-Ratio Banks + NS Steps 5→4
Based on Gram Newton-Schulz from Dao AI Lab. Added
gram_newton_schulz5()that iterates on the smaller n×n Gram matrix (float32) instead of the full rectangular matrix. Dispatched automatically based on aspect ratio:Cost analysis (without symmetric GEMM): Standard NS costs n³(10α + 5), Gram NS costs n³(4α + 20). Break-even at α = 2.5 — only MLP banks (α = 4) benefit, saving ~20% on those iterations.
Gram NS includes a restart at step 2 to kill spurious negative eigenvalues.
Also reduced
muon_backend_stepsfrom 5 → 4 (sufficient with current architecture).3. Gemma-style Global/Local Attention
Inspired by Gemma 4's interleaved attention pattern:
rope_dimsout ofhead_dim). Partial RoPE avoids high-frequency positional noise at long range.local_window_size=512) with full RoPE (all dims). Full positional precision within the window.Per-layer
window_sizepassed to bothflash_attn_varlen_funcandflash_attn_3_func. This is a direct throughput improvement — local layers only attend within the window instead of full sequence.4. KV-Tying on Global Attention Layers
On global attention layers, K and V share the same weight matrix (
v_bankentry omitted, K weights serve double duty). This:num_global_layers × kv_dim × model_dimparameters (3 × 256 × 512 = 393K params)The separate Q/K/V/O bank structure (vs PR #1648's merged QO/KV banks) enables this selective tying. QKV weights are concatenated into a single
qkv_wtensor at forward time for a fusedF.linearcall.How we differ from Gemma 4's K=V tying:
Gemma does two things to compensate for K=V tying that we don't:
We're tying K=V with a relatively mild 2:1 GQA and standard key dims. That means our shared projection has less capacity to serve both purposes. With 4 KV heads at dim=64, each head has to produce a 64-dim vector that simultaneously works for similarity matching (after RoPE + RMS norm) and content carrying (raw). That's a tighter constraint than Gemma faces.
Additional Changes
.contiguous()fix in Muonall_gather_into_tensorfor torch 2.11 compatibilityResults