Non-record: xIELU Piecewise Quadratic Activation + Per-Layer QK Gain Convergence#1648
Open
mikeapedia wants to merge 1 commit intoopenai:mainfrom
Open
Non-record: xIELU Piecewise Quadratic Activation + Per-Layer QK Gain Convergence#1648mikeapedia wants to merge 1 commit intoopenai:mainfrom
mikeapedia wants to merge 1 commit intoopenai:mainfrom
Conversation
Author
|
@dexhunter sharing an experiment that showed promise but I didn't have the compute to finish testing. |
mikeapedia
added a commit
to mikeapedia/parameter-golf-1
that referenced
this pull request
Apr 16, 2026
Builds on PR openai#1648 (xIELU activation + per-layer QK gain). Adds four techniques for the community to explore: 1. Parcae constrained loop injection (SSM-inspired loop boundaries) 2. Gram NS for high-aspect-ratio MLP banks (α≥2.5 breakeven) + NS 5→4 3. Gemma-style global/local attention with sliding window 4. KV-tying on global attention layers Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This was referenced Apr 16, 2026
leon2k2k2k
added a commit
to leon2k2k2k/parameter-golf
that referenced
this pull request
Apr 20, 2026
Scanned 200 PRs from 2026-04-11..20. After exclusion filters, 3 candidates beat spec 011's expected Δ: openai#1682 (GradPower Muon p=0.9), openai#1648 (xIELU + per-layer QK gain), openai#1555 (Tap-In eval cache). Full artifacts at ~/competition-pr/pr-scan-2026-04-20/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leon2k2k2k
added a commit
to leon2k2k2k/parameter-golf
that referenced
this pull request
Apr 20, 2026
Bundles three orthogonal training-time levers into one retrain: - tapered Muon WD (port openai#1729, originally spec 011) - GradPower p=0.9 (port openai#1682) - softer QK_GAIN init 5.0 → 2.5 (port openai#1648, simplified from per-layer convergence) Code patch at exp/training-bundle (commit 8d54854). All env-gated with no-op defaults. Supersedes spec 011 which is kept as a design-doc reference. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leon2k2k2k
added a commit
to leon2k2k2k/parameter-golf
that referenced
this pull request
Apr 20, 2026
Four new env-gated hyperparameters, all default to no-op so spec 008 is byte-identical when the vars are unset: - WD_TAPER_START_FRAC / WD_TAPER_FINAL_MULT (port openai#1729): linear Muon WD taper from 1.0 at start_step to final_mult at h.iterations. Applied in step_fn before optimizers.step. Adam/embed WD untouched per openai#1729. - MUON_GRAD_POWER (port openai#1682): g = sign(g) * |g|^p, applied to Muon gradients just before the momentum buffer update. Covers both sharded (shard path) and non-sharded paths. - QK_GAIN_INIT (existing): already present, lowering default not changed; setting QK_GAIN_INIT=2.5 at runtime gives uniform softer attention per openai#1648's convergence finding. - QK_GAIN_PER_LAYER (new): comma-sep list, overrides each block's attn.q_gain after block construction. Validated to match num_layers. Also: one startup log line echoing the four values for post-hoc verification. Spec: research/specs/012-training-bundle.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leon2k2k2k
added a commit
to leon2k2k2k/parameter-golf
that referenced
this pull request
Apr 26, 2026
044A: QK_GAIN_INIT=2.5 (down from 5.0) — single env-var change vs canonical baseline. Tests the direction (softer attention init) per PR openai#1648's cross- seed convergence evidence. Different lever class entirely from loop-shape work, so should NOT have the negative-interaction problem 041L hit. 044B: QK_GAIN_INIT=2.5 + 041I config (loop 4-5, NL=2, frac=0.25). Tests whether QK softening stacks with the strongest known loop variant. Two different lever classes (attention init + loop architecture) should combine more cleanly than two same-class levers. Both READY, parallel-launchable on separate pods. ~$7 total. Decision matrix in 044B body for interpreting joint outcomes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
8 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Non-record research submission. I'm out of compute credits but wanted to share these results with the community since I think they show potential. Built on top of PR #1586 by @dexhunter — two activation/attention techniques with converged per-layer coefficients, plus a modified fused Triton kernel:
leaky_relu(x, 0.5).square()with a piecewise quadratictorch.where(x > 0, ap*x² + bp*x, an*x² + bn*x)using 4 hardcoded per-layer coefficients discovered via convergence loop. Zero throughput overhead.resid_mix(x0 skip blending) to the attention lane input. We apply it to both lanes so the MLP lane also gets the x0 residual signal.Both learnable techniques use a convergence loop methodology: train with learnable per-layer scalars → harvest post-EMA values → hardcode as init → retrain with different seed → average across seeds until stable.
xIELU Activation
The original xIELU paper uses
expm1(x)in the negative branch, but this caused a 20% throughput drop even insidetorch.compile. We replaced it with a pure quadratic — same gradient shape, no transcendentals, zero overhead.Converged coefficients (averaged over 4 seeds):
Key findings:
Per-Layer QK Gain
Converged from uniform 5.0 init over 3 seeds (42, 1337, 2024):
Every layer dropped dramatically from 5.0 — the model wants much softer attention across the board (range 2.0–2.9). Later layers prefer slightly softer attention.
Fused Triton Kernel Changes vs PR #1586
PR #1586's
linear_leaky_relu_square_kernelfuses matmul +leaky_relu(h, 0.5).square()into a single Triton kernel using TMA descriptors. We modified it tolinear_xielu_kernelwith the following changes:Forward path — replaced the hardcoded activation:
The factored form
h * (ap*h + bp)saves a multiply per element vs the expandedap*h*h + bp*h— 2 multiplies + 1 add inside the where + 1 outer multiply, vs 3 multiplies + 1 add.Backward path — updated the fused gradient accordingly:
Additional kernel changes:
ap, an, bp, bn) per layer, passed from the hardcoded coefficient arraysFusedLinearLeakyReLUSquareFunction→FusedXieluMLPFunctionwith matching arg threadingSymmetric resid_mix on Both Parallel Lanes
PR #1586 applies
resid_mix(x0 skip connection blending) only to the attention lane:We apply the same mixing to both lanes:
This gives the MLP lane access to the same residual signal, making the two lanes symmetric. The
resid_mixweights are already learnable, so the model can still choose to suppress the x0 contribution if it's not helpful.Methodology: Convergence loops
Both techniques use the same pattern:
This is general-purpose — applicable to any per-layer hyperparameter. The key insight is that per-channel learnable params are too expensive (16% throughput drop from 4×2048 softplus ops per layer), but per-layer scalars have zero measurable overhead.
Results
All results are post-EMA, pre-quantization val_bpb on 8×H100. We did not run full evaluation or TTT. The xIELU ablation was done on an earlier training base (PR #1529), so we report relative improvement rather than absolute numbers since the baselines differ:
Test plan
🤖 Generated with Claude Code