PR #1648

open

Non-record: xIELU Piecewise Quadratic Activation + Per-Layer QK Gain Convergence

val_bpb

1.0756

Architecture

Transformer

Optimizer

—

Artifact Size

—

Training Techniques

Architecture

LeakyReLU

Replaced leaky_relu(x, 0.5).square() with a piecewise quadratic xIELU activation using per-layer hardcoded coefficients.

parameters: {"layers":11}

attention modification

Applied per-layer QK gain initialization with converged softer attention values instead of the default uniform gain.

parameters: {"layers":11}

Initialization

resid mix

Applied symmetric resid_mix on both attention and MLP parallel lanes so both receive x0 residual blending.

Weight Averaging

EMA

parameters: null

xIELU piecewise quadratic activation with hardcoded per-layer coefficients
Per-layer QK gain convergence from uniform 5.0 to softer 2.0–3.0 values
Symmetric resid_mix applied to both parallel lanes
Fused Triton kernel updated for xIELU with zero throughput overhead
Convergence-loop methodology for harvesting and hardcoding stable per-layer scalars