Skip to content

Non-record: xIELU Piecewise Quadratic Activation + Per-Layer QK Gain Convergence#1648

Open
mikeapedia wants to merge 1 commit intoopenai:mainfrom
mikeapedia:nonrecord/xielu-qkgain
Open

Non-record: xIELU Piecewise Quadratic Activation + Per-Layer QK Gain Convergence#1648
mikeapedia wants to merge 1 commit intoopenai:mainfrom
mikeapedia:nonrecord/xielu-qkgain

Conversation

@mikeapedia
Copy link
Copy Markdown

@mikeapedia mikeapedia commented Apr 15, 2026

Summary

Non-record research submission. I'm out of compute credits but wanted to share these results with the community since I think they show potential. Built on top of PR #1586 by @dexhunter — two activation/attention techniques with converged per-layer coefficients, plus a modified fused Triton kernel:

  • xIELU activation (inspired by arxiv 2411.13010): Replaces leaky_relu(x, 0.5).square() with a piecewise quadratic torch.where(x > 0, ap*x² + bp*x, an*x² + bn*x) using 4 hardcoded per-layer coefficients discovered via convergence loop. Zero throughput overhead.
  • Per-layer QK gain init: Converged from uniform 5.0 to per-layer values in the 2.0–3.0 range via 3-seed averaging. The model consistently prefers much softer attention than the 5.0 default.
  • Symmetric resid_mix on both parallel lanes: PR Record: Per-Layer Adaptive GPTQ Clip + int7 Embeddings + MLR 0.026 — val_bpb 1.07493 (3-seed mean) #1586 only applies resid_mix (x0 skip blending) to the attention lane input. We apply it to both lanes so the MLP lane also gets the x0 residual signal.

Both learnable techniques use a convergence loop methodology: train with learnable per-layer scalars → harvest post-EMA values → hardcode as init → retrain with different seed → average across seeds until stable.

xIELU Activation

The original xIELU paper uses expm1(x) in the negative branch, but this caused a 20% throughput drop even inside torch.compile. We replaced it with a pure quadratic — same gradient shape, no transcendentals, zero overhead.

Converged coefficients (averaged over 4 seeds):

XIELU_AP = [0.103, 0.196, 1.415, 1.196, 1.485, 1.546, 1.337, 1.727, 1.495, 0.988, 0.917]
XIELU_AN = [0.39, 0.578, 0.363, 0.491, 0.536, 0.548, 0.579, 0.983, 1.058, 0.935, 0.845]
XIELU_BP = [0.126, 0.07, 0.0, 0.0, 0.0, 0.002, 0.017, 0.067, 0.005, 0.058, 0.568]
XIELU_BN = [0.785, 0.638, 0.405, 0.377, 0.314, 0.289, 0.313, 0.571, 0.42, 0.286, 0.52]

Key findings:

  • beta_p collapsed to ~0 for layers 1–8 — the model wants pure x² on the positive side for middle layers
  • Early layers (0–1) are nearly linear — alpha_p ≈ 0.1–0.2 (very low curvature)
  • Deep layers want strong asymmetric curvature — alpha_p up to 1.73 vs alpha_n ≈ 0.5–1.0
  • Layer 10 retains a linear term — beta_p ≈ 0.57, unlike all other layers

Per-Layer QK Gain

Converged from uniform 5.0 init over 3 seeds (42, 1337, 2024):

QK_GAIN_INIT_PER_LAYER = [2.3495, 2.8818, 2.7627, 2.8148, 2.7893, 2.8762, 2.5657, 2.7206, 2.6426, 2.2737, 1.9741]

Every layer dropped dramatically from 5.0 — the model wants much softer attention across the board (range 2.0–2.9). Later layers prefer slightly softer attention.

Fused Triton Kernel Changes vs PR #1586

PR #1586's linear_leaky_relu_square_kernel fuses matmul + leaky_relu(h, 0.5).square() into a single Triton kernel using TMA descriptors. We modified it to linear_xielu_kernel with the following changes:

Forward path — replaced the hardcoded activation:

# PR #1586 original (unfactored):
aux = where(h > 0, h, 0.5*h)
aux = aux * aux                           # separate square step

# Ours (xIELU, factored):
aux = h * where(h > 0, ap*h + bp, an*h + bn)  # single fused expression

The factored form h * (ap*h + bp) saves a multiply per element vs the expanded ap*h*h + bp*h — 2 multiplies + 1 add inside the where + 1 outer multiply, vs 3 multiplies + 1 add.

Backward path — updated the fused gradient accordingly:

# PR #1586 original:
grad *= where(h > 0, 2*h, 0.5*h)

# Ours (xIELU):
grad *= where(h > 0, 2*ap*h + bp, 2*an*h + bn)

Additional kernel changes:

  • 4 extra scalar args (ap, an, bp, bn) per layer, passed from the hardcoded coefficient arrays
  • Renamed FusedLinearLeakyReLUSquareFunctionFusedXieluMLPFunction with matching arg threading
  • No change to tile sizes, TMA descriptor layout, or persistent kernel structure

Symmetric resid_mix on Both Parallel Lanes

PR #1586 applies resid_mix (x0 skip connection blending) only to the attention lane:

# PR #1586:
attn_read = mix[0] * lane0 + mix[1] * x0   # attention gets x0 blending
mlp_read  = lane1                            # MLP reads lane1 directly

We apply the same mixing to both lanes:

# Ours:
attn_read = mix[0] * lane0 + mix[1] * x0   # attention gets x0 blending
mlp_read  = mix[0] * lane1 + mix[1] * x0   # MLP also gets x0 blending

This gives the MLP lane access to the same residual signal, making the two lanes symmetric. The resid_mix weights are already learnable, so the model can still choose to suppress the x0 contribution if it's not helpful.

Methodology: Convergence loops

Both techniques use the same pattern:

  1. Add learnable per-layer scalars (zero throughput impact)
  2. Train and harvest post-EMA values
  3. Hardcode as init, retrain with different seed
  4. Average across seeds until max delta < 0.2
  5. Lock coefficients for zero-overhead deployment

This is general-purpose — applicable to any per-layer hyperparameter. The key insight is that per-channel learnable params are too expensive (16% throughput drop from 4×2048 softplus ops per layer), but per-layer scalars have zero measurable overhead.

Results

All results are post-EMA, pre-quantization val_bpb on 8×H100. We did not run full evaluation or TTT. The xIELU ablation was done on an earlier training base (PR #1529), so we report relative improvement rather than absolute numbers since the baselines differ:

Technique Δ val_bpb Notes
xIELU activation (hardcoded coefficients) −0.0029 vs same-base baseline, zero throughput overhead
Per-layer q_gain init (converged from 5.0) −0.0009 vs uniform 5.0 init, val_bpb stabilized at 1.0756–1.0765 across 3 seeds

⚠️ Not verified on competition eval infrastructure — no eval, no TTT, no quantization roundtrip. I ran out of compute credits. Sharing for the community to build on.

Test plan

  • Syntax check passes
  • Training runs complete without errors on 1×H100 and 8×H100
  • Zero throughput overhead vs baseline (hardcoded coefficients)
  • xIELU convergence stable across 4 seeds (max delta < 0.25)
  • QK gain convergence stable across 3 seeds (all values in 2.0–3.1 range)
  • Full competition eval (no compute credits — someone else please verify!)

🤖 Generated with Claude Code

@mikeapedia
Copy link
Copy Markdown
Author

@dexhunter sharing an experiment that showed promise but I didn't have the compute to finish testing.

mikeapedia added a commit to mikeapedia/parameter-golf-1 that referenced this pull request Apr 16, 2026
Builds on PR openai#1648 (xIELU activation + per-layer QK gain). Adds four
techniques for the community to explore:

1. Parcae constrained loop injection (SSM-inspired loop boundaries)
2. Gram NS for high-aspect-ratio MLP banks (α≥2.5 breakeven) + NS 5→4
3. Gemma-style global/local attention with sliding window
4. KV-tying on global attention layers

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
Scanned 200 PRs from 2026-04-11..20. After exclusion filters, 3 candidates
beat spec 011's expected Δ: openai#1682 (GradPower Muon p=0.9), openai#1648 (xIELU +
per-layer QK gain), openai#1555 (Tap-In eval cache). Full artifacts at
~/competition-pr/pr-scan-2026-04-20/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
Bundles three orthogonal training-time levers into one retrain:
- tapered Muon WD (port openai#1729, originally spec 011)
- GradPower p=0.9 (port openai#1682)
- softer QK_GAIN init 5.0 → 2.5 (port openai#1648, simplified from per-layer
  convergence)

Code patch at exp/training-bundle (commit 8d54854). All env-gated with
no-op defaults.

Supersedes spec 011 which is kept as a design-doc reference.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 20, 2026
Four new env-gated hyperparameters, all default to no-op so spec 008 is
byte-identical when the vars are unset:

- WD_TAPER_START_FRAC / WD_TAPER_FINAL_MULT (port openai#1729): linear Muon WD
  taper from 1.0 at start_step to final_mult at h.iterations. Applied in
  step_fn before optimizers.step. Adam/embed WD untouched per openai#1729.
- MUON_GRAD_POWER (port openai#1682): g = sign(g) * |g|^p, applied to Muon
  gradients just before the momentum buffer update. Covers both sharded
  (shard path) and non-sharded paths.
- QK_GAIN_INIT (existing): already present, lowering default not changed;
  setting QK_GAIN_INIT=2.5 at runtime gives uniform softer attention per
  openai#1648's convergence finding.
- QK_GAIN_PER_LAYER (new): comma-sep list, overrides each block's
  attn.q_gain after block construction. Validated to match num_layers.

Also: one startup log line echoing the four values for post-hoc verification.

Spec: research/specs/012-training-bundle.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leon2k2k2k added a commit to leon2k2k2k/parameter-golf that referenced this pull request Apr 26, 2026
044A: QK_GAIN_INIT=2.5 (down from 5.0) — single env-var change vs canonical
baseline. Tests the direction (softer attention init) per PR openai#1648's cross-
seed convergence evidence. Different lever class entirely from loop-shape
work, so should NOT have the negative-interaction problem 041L hit.

044B: QK_GAIN_INIT=2.5 + 041I config (loop 4-5, NL=2, frac=0.25). Tests
whether QK softening stacks with the strongest known loop variant. Two
different lever classes (attention init + loop architecture) should
combine more cleanly than two same-class levers.

Both READY, parallel-launchable on separate pods. ~$7 total. Decision
matrix in 044B body for interpreting joint outcomes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant