Non-record: xIELU Piecewise Quadratic Activation + Per-Layer QK Gain Convergence by mikeapedia · Pull Request #1648 · openai/parameter-golf

mikeapedia · 2026-04-15T20:13:14Z

Summary

Non-record research submission. I'm out of compute credits but wanted to share these results with the community since I think they show potential. Built on top of PR #1586 by @dexhunter — two activation/attention techniques with converged per-layer coefficients, plus a modified fused Triton kernel:

xIELU activation (inspired by arxiv 2411.13010): Replaces leaky_relu(x, 0.5).square() with a piecewise quadratic torch.where(x > 0, ap*x² + bp*x, an*x² + bn*x) using 4 hardcoded per-layer coefficients discovered via convergence loop. Zero throughput overhead.
Per-layer QK gain init: Converged from uniform 5.0 to per-layer values in the 2.0–3.0 range via 3-seed averaging. The model consistently prefers much softer attention than the 5.0 default.
Symmetric resid_mix on both parallel lanes: PR Record: Per-Layer Adaptive GPTQ Clip + int7 Embeddings + MLR 0.026 — val_bpb 1.07493 (3-seed mean) #1586 only applies resid_mix (x0 skip blending) to the attention lane input. We apply it to both lanes so the MLP lane also gets the x0 residual signal.

Both learnable techniques use a convergence loop methodology: train with learnable per-layer scalars → harvest post-EMA values → hardcode as init → retrain with different seed → average across seeds until stable.

xIELU Activation

The original xIELU paper uses expm1(x) in the negative branch, but this caused a 20% throughput drop even inside torch.compile. We replaced it with a pure quadratic — same gradient shape, no transcendentals, zero overhead.

Converged coefficients (averaged over 4 seeds):

XIELU_AP = [0.103, 0.196, 1.415, 1.196, 1.485, 1.546, 1.337, 1.727, 1.495, 0.988, 0.917]
XIELU_AN = [0.39, 0.578, 0.363, 0.491, 0.536, 0.548, 0.579, 0.983, 1.058, 0.935, 0.845]
XIELU_BP = [0.126, 0.07, 0.0, 0.0, 0.0, 0.002, 0.017, 0.067, 0.005, 0.058, 0.568]
XIELU_BN = [0.785, 0.638, 0.405, 0.377, 0.314, 0.289, 0.313, 0.571, 0.42, 0.286, 0.52]

Key findings:

beta_p collapsed to ~0 for layers 1–8 — the model wants pure x² on the positive side for middle layers
Early layers (0–1) are nearly linear — alpha_p ≈ 0.1–0.2 (very low curvature)
Deep layers want strong asymmetric curvature — alpha_p up to 1.73 vs alpha_n ≈ 0.5–1.0
Layer 10 retains a linear term — beta_p ≈ 0.57, unlike all other layers

Per-Layer QK Gain

Converged from uniform 5.0 init over 3 seeds (42, 1337, 2024):

QK_GAIN_INIT_PER_LAYER = [2.3495, 2.8818, 2.7627, 2.8148, 2.7893, 2.8762, 2.5657, 2.7206, 2.6426, 2.2737, 1.9741]

Every layer dropped dramatically from 5.0 — the model wants much softer attention across the board (range 2.0–2.9). Later layers prefer slightly softer attention.

Fused Triton Kernel Changes vs PR #1586

PR #1586's linear_leaky_relu_square_kernel fuses matmul + leaky_relu(h, 0.5).square() into a single Triton kernel using TMA descriptors. We modified it to linear_xielu_kernel with the following changes:

Forward path — replaced the hardcoded activation:

# PR #1586 original (unfactored):
aux = where(h > 0, h, 0.5*h)
aux = aux * aux                           # separate square step

# Ours (xIELU, factored):
aux = h * where(h > 0, ap*h + bp, an*h + bn)  # single fused expression

The factored form h * (ap*h + bp) saves a multiply per element vs the expanded ap*h*h + bp*h — 2 multiplies + 1 add inside the where + 1 outer multiply, vs 3 multiplies + 1 add.

Backward path — updated the fused gradient accordingly:

# PR #1586 original:
grad *= where(h > 0, 2*h, 0.5*h)

# Ours (xIELU):
grad *= where(h > 0, 2*ap*h + bp, 2*an*h + bn)

Additional kernel changes:

4 extra scalar args (ap, an, bp, bn) per layer, passed from the hardcoded coefficient arrays
Renamed FusedLinearLeakyReLUSquareFunction → FusedXieluMLPFunction with matching arg threading
No change to tile sizes, TMA descriptor layout, or persistent kernel structure

Symmetric resid_mix on Both Parallel Lanes

PR #1586 applies resid_mix (x0 skip connection blending) only to the attention lane:

# PR #1586:
attn_read = mix[0] * lane0 + mix[1] * x0   # attention gets x0 blending
mlp_read  = lane1                            # MLP reads lane1 directly

We apply the same mixing to both lanes:

# Ours:
attn_read = mix[0] * lane0 + mix[1] * x0   # attention gets x0 blending
mlp_read  = mix[0] * lane1 + mix[1] * x0   # MLP also gets x0 blending

This gives the MLP lane access to the same residual signal, making the two lanes symmetric. The resid_mix weights are already learnable, so the model can still choose to suppress the x0 contribution if it's not helpful.

Methodology: Convergence loops

Both techniques use the same pattern:

Add learnable per-layer scalars (zero throughput impact)
Train and harvest post-EMA values
Hardcode as init, retrain with different seed
Average across seeds until max delta < 0.2
Lock coefficients for zero-overhead deployment

This is general-purpose — applicable to any per-layer hyperparameter. The key insight is that per-channel learnable params are too expensive (16% throughput drop from 4×2048 softplus ops per layer), but per-layer scalars have zero measurable overhead.

Results

All results are post-EMA, pre-quantization val_bpb on 8×H100. We did not run full evaluation or TTT. The xIELU ablation was done on an earlier training base (PR #1529), so we report relative improvement rather than absolute numbers since the baselines differ:

Technique	Δ val_bpb	Notes
xIELU activation (hardcoded coefficients)	−0.0029	vs same-base baseline, zero throughput overhead
Per-layer q_gain init (converged from 5.0)	−0.0009	vs uniform 5.0 init, val_bpb stabilized at 1.0756–1.0765 across 3 seeds

⚠️ Not verified on competition eval infrastructure — no eval, no TTT, no quantization roundtrip. I ran out of compute credits. Sharing for the community to build on.

Test plan

Syntax check passes
Training runs complete without errors on 1×H100 and 8×H100
Zero throughput overhead vs baseline (hardcoded coefficients)
xIELU convergence stable across 4 seeds (max delta < 0.25)
QK gain convergence stable across 3 seeds (all values in 2.0–3.1 range)
Full competition eval (no compute credits — someone else please verify!)

🤖 Generated with Claude Code

…convergence

mikeapedia · 2026-04-15T22:11:49Z

@dexhunter sharing an experiment that showed promise but I didn't have the compute to finish testing.

Builds on PR openai#1648 (xIELU activation + per-layer QK gain). Adds four techniques for the community to explore: 1. Parcae constrained loop injection (SSM-inspired loop boundaries) 2. Gram NS for high-aspect-ratio MLP banks (α≥2.5 breakeven) + NS 5→4 3. Gemma-style global/local attention with sliding window 4. KV-tying on global attention layers Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Scanned 200 PRs from 2026-04-11..20. After exclusion filters, 3 candidates beat spec 011's expected Δ: openai#1682 (GradPower Muon p=0.9), openai#1648 (xIELU + per-layer QK gain), openai#1555 (Tap-In eval cache). Full artifacts at ~/competition-pr/pr-scan-2026-04-20/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Bundles three orthogonal training-time levers into one retrain: - tapered Muon WD (port openai#1729, originally spec 011) - GradPower p=0.9 (port openai#1682) - softer QK_GAIN init 5.0 → 2.5 (port openai#1648, simplified from per-layer convergence) Code patch at exp/training-bundle (commit 8d54854). All env-gated with no-op defaults. Supersedes spec 011 which is kept as a design-doc reference. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Four new env-gated hyperparameters, all default to no-op so spec 008 is byte-identical when the vars are unset: - WD_TAPER_START_FRAC / WD_TAPER_FINAL_MULT (port openai#1729): linear Muon WD taper from 1.0 at start_step to final_mult at h.iterations. Applied in step_fn before optimizers.step. Adam/embed WD untouched per openai#1729. - MUON_GRAD_POWER (port openai#1682): g = sign(g) * |g|^p, applied to Muon gradients just before the momentum buffer update. Covers both sharded (shard path) and non-sharded paths. - QK_GAIN_INIT (existing): already present, lowering default not changed; setting QK_GAIN_INIT=2.5 at runtime gives uniform softer attention per openai#1648's convergence finding. - QK_GAIN_PER_LAYER (new): comma-sep list, overrides each block's attn.q_gain after block construction. Validated to match num_layers. Also: one startup log line echoing the four values for post-hoc verification. Spec: research/specs/012-training-bundle.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

044A: QK_GAIN_INIT=2.5 (down from 5.0) — single env-var change vs canonical baseline. Tests the direction (softer attention init) per PR openai#1648's cross- seed convergence evidence. Different lever class entirely from loop-shape work, so should NOT have the negative-interaction problem 041L hit. 044B: QK_GAIN_INIT=2.5 + 041I config (loop 4-5, NL=2, frac=0.25). Tests whether QK softening stacks with the strongest known loop variant. Two different lever classes (attention init + loop architecture) should combine more cleanly than two same-class levers. Both READY, parallel-launchable on separate pods. ~$7 total. Decision matrix in 044B body for interpreting joint outcomes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Non-record: xIELU piecewise quadratic activation + per-layer QK gain …

9b04129

…convergence

This was referenced Apr 16, 2026

Non-record: Parcae Loop Injection + Gemma-style Attention + Gram NS #1674

Open

Non-record: Neural Base Model, No TTT — Parcae + Gates + Layered Windows (val_bpb 1.07706) #1728

Open

yaowubarbara mentioned this pull request Apr 30, 2026

{RECORD} CaseOps pre-quant TTT record (1.0354 BPB) #1911

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: xIELU Piecewise Quadratic Activation + Per-Layer QK Gain Convergence#1648

Non-record: xIELU Piecewise Quadratic Activation + Per-Layer QK Gain Convergence#1648
mikeapedia wants to merge 1 commit intoopenai:mainfrom
mikeapedia:nonrecord/xielu-qkgain

mikeapedia commented Apr 15, 2026 •

edited

Loading

Uh oh!

mikeapedia commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mikeapedia commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

xIELU Activation

Per-Layer QK Gain

Fused Triton Kernel Changes vs PR #1586

Symmetric resid_mix on Both Parallel Lanes

Methodology: Convergence loops

Results

Test plan

Uh oh!

mikeapedia commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mikeapedia commented Apr 15, 2026 •

edited

Loading