PR #1427

open

LeakyReLU + XSA + PartialRoPE + FA3 submission — val_bpb 1.1991

val_bpb

1.2092

Architecture

Transformer

Optimizer

Muon

Artifact Size

~14.39 MB

Training Techniques

Architecture

LeakyReLU

MLP activation changed from ReLU² to LeakyReLU(0.75)².

parameters: {"negative_slope":0.75}

Partial RoPE

Rotary embedding applied to only part of each head dimension.

parameters: {"dimensions":16,"head_dimensions":64}

XSA

XSA enabled only in the deepest layers.

parameters: {"layers":4}

FlashAttention-3

Standard SDPA replaced with FlashAttention-3 for attention computation.

parameters: null

Quantization

mixed int6

bits: 6

scope: model weights

Compression

lzma

level: null

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"matrix_lr":null}