PR #434

closed

10L XSA + LeakyReLU² + Partial RoPE (val_bpb=1.1370)

by parinzeeView on GitHub

val_bpb

1.1370

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.9 MB

Training Techniques

Architecture

XSA

Exclusive Self Attention on the last 4 layers; removes self-value projection from attention output.

parameters: {"layers":4}

activation

Replaces ReLU² with LeakyReLU(0.5)².

parameters: {"negative_slope":0.5}

Partial RoPE

Applies rotary position embeddings to only 25% of head dimensions.

parameters: {"head_dims_rotary":16,"head_dims_total":64,"fraction":0.25}

tied embeddings

Uses tied input/output embeddings.

parameters: null

KV head count

Uses grouped-query attention with 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"matrix_lr":0.025}

AdamW

weight_decay: 0.04

momentum: null

other_params: {"used_for":"embeddings/scalars"}

Weight Averaging

SWA

parameters: {"start_frac":0.4,"every":50}

Quantization

mixed int5/int6

bits: null

scope: MLP weights and attention weights

fp16

bits: 16

scope: tied embeddings

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Initialization

OrthoInit

Orthogonal initialization with muP-scaled output projections.

LR Schedule

warmdown

parameters: {"warmdown_steps":3000,"warmup_steps":20}

Regularization

weight decay

parameters: {"weight_decay":0.04}

magnitude pruning

parameters: {"pruning_rate":0.08}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Novel Contributions

Exclusive Self Attention (XSA) on the last 4 layers
LeakyReLU(0.5)² activation replacing ReLU²
Partial RoPE applied to 25% of head dimensions
Higher learning rates for matrix, scalar, and tied embedding parameters
Increased magnitude pruning to satisfy artifact size constraints