PR #434

closed

10L XSA + LeakyReLU² + Partial RoPE (val_bpb=1.1370)

by parinzeeView on GitHub
val_bpb
1.1370
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.9 MB

Training Techniques

Architecture
XSA
Exclusive Self Attention on the last 4 layers; removes self-value projection from attention output.
parameters: {"layers":4}
activation
Replaces ReLU² with LeakyReLU(0.5)².
parameters: {"negative_slope":0.5}
Partial RoPE
Applies rotary position embeddings to only 25% of head dimensions.
parameters: {"head_dims_rotary":16,"head_dims_total":64,"fraction":0.25}
tied embeddings
Uses tied input/output embeddings.
parameters: null
KV head count
Uses grouped-query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"matrix_lr":0.025}
AdamW
weight_decay: 0.04
momentum: null
other_params: {"used_for":"embeddings/scalars"}
Weight Averaging
SWA
parameters: {"start_frac":0.4,"every":50}
Quantization
mixed int5/int6
bits: null
scope: MLP weights and attention weights
fp16
bits: 16
scope: tied embeddings
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Initialization
OrthoInit
Orthogonal initialization with muP-scaled output projections.
LR Schedule
warmdown
parameters: {"warmdown_steps":3000,"warmup_steps":20}
Regularization
weight decay
parameters: {"weight_decay":0.04}
magnitude pruning
parameters: {"pruning_rate":0.08}
Sequence Length
sequence_length
train_length: 2048
eval_length: null

Novel Contributions

  • Exclusive Self Attention (XSA) on the last 4 layers
  • LeakyReLU(0.5)² activation replacing ReLU²
  • Partial RoPE applied to 25% of head dimensions
  • Higher learning rates for matrix, scalar, and tied embedding parameters
  • Increased magnitude pruning to satisfy artifact size constraints