PR #1648
openNon-record: xIELU Piecewise Quadratic Activation + Per-Layer QK Gain Convergence
by mikeapediaView on GitHub
val_bpb
1.0756
Architecture
Transformer
Optimizer
—
Artifact Size
—
Training Techniques
Architecture
LeakyReLU
Replaced leaky_relu(x, 0.5).square() with a piecewise quadratic xIELU activation using per-layer hardcoded coefficients.
parameters: {"layers":11}
attention modification
Applied per-layer QK gain initialization with converged softer attention values instead of the default uniform gain.
parameters: {"layers":11}
Initialization
resid mix
Applied symmetric resid_mix on both attention and MLP parallel lanes so both receive x0 residual blending.
Weight Averaging
EMA
parameters: null
Novel Contributions
- xIELU piecewise quadratic activation with hardcoded per-layer coefficients
- Per-layer QK gain convergence from uniform 5.0 to softer 2.0–3.0 values
- Symmetric resid_mix applied to both parallel lanes
- Fused Triton kernel updated for xIELU with zero throughput overhead
- Convergence-loop methodology for harvesting and hardcoding stable per-layer scalars