PR #1427

open

LeakyReLU + XSA + PartialRoPE + FA3 submission — val_bpb 1.1991

val_bpb
1.2092
Architecture
Transformer
Optimizer
Muon
Artifact Size
~14.39 MB

Training Techniques

Architecture
LeakyReLU
MLP activation changed from ReLU² to LeakyReLU(0.75)².
parameters: {"negative_slope":0.75}
Partial RoPE
Rotary embedding applied to only part of each head dimension.
parameters: {"dimensions":16,"head_dimensions":64}
XSA
XSA enabled only in the deepest layers.
parameters: {"layers":4}
FlashAttention-3
Standard SDPA replaced with FlashAttention-3 for attention computation.
parameters: null
Quantization
mixed int6
bits: 6
scope: model weights
Compression
lzma
level: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"matrix_lr":null}

Novel Contributions

  • LeakyReLU(0.75)² replaces ReLU² in the MLP.
  • Partial RoPE is used with 16 of 64 head dimensions.
  • XSA is applied only to the last 4 layers.
  • FlashAttention-3 is used instead of standard SDPA.
  • GPTQ-style mixed int6 export with lZMA compression and selective pruning.