PR #1648

open

Non-record: xIELU Piecewise Quadratic Activation + Per-Layer QK Gain Convergence

by mikeapediaView on GitHub
val_bpb
1.0756
Architecture
Transformer
Optimizer
Artifact Size

Training Techniques

Architecture
LeakyReLU
Replaced leaky_relu(x, 0.5).square() with a piecewise quadratic xIELU activation using per-layer hardcoded coefficients.
parameters: {"layers":11}
attention modification
Applied per-layer QK gain initialization with converged softer attention values instead of the default uniform gain.
parameters: {"layers":11}
Initialization
resid mix
Applied symmetric resid_mix on both attention and MLP parallel lanes so both receive x0 residual blending.
Weight Averaging
EMA
parameters: null

Novel Contributions

  • xIELU piecewise quadratic activation with hardcoded per-layer coefficients
  • Per-layer QK gain convergence from uniform 5.0 to softer 2.0–3.0 values
  • Symmetric resid_mix applied to both parallel lanes
  • Fused Triton kernel updated for xIELU with zero throughput overhead
  • Convergence-loop methodology for harvesting and hardcoding stable per-layer scalars