PR #715

open

Record: XSA-all + LeakyReLU² + VR + GA + 7-gram cache (val_bpb=1.0337)

by Asukabot0View on GitHub

val_bpb

1.0337

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.99MB

Training Techniques

Quantization

int6

bits: 6

scope: all

Architecture

XSA

Exclusive Self-Attention applied on all 11 layers

parameters: {"layers":11}

LeakyReLU

LeakyReLU(0.5)^2 activation used in place of ReLU^2 to preserve negative gradient flow

parameters: {"negative_slope":0.5,"squared":true}

Value Residual

Layer 0 value output is mixed into subsequent layers via learned sigmoid gates

parameters: null

Gated Attention

Per-head sigmoid gates on attention output

parameters: null

MLP3x

Transformer MLP uses 3x expansion

parameters: {"multiplier":3}

Partial RoPE

Rotary positional embeddings applied to a subset of dimensions

parameters: {"dimensions":"16/64"}

BigramHash

BigramHash feature with 4096 buckets

parameters: {"buckets":4096}

SmearGate

SmearGate component used in the architecture

parameters: null

U-Net skip connections

U-Net style skip connections added to the transformer

parameters: null

GQA

Grouped-query attention with 8 attention heads and 4 KV heads

parameters: {"heads":8,"kv_heads":4}

Optimizer

Muon

weight_decay: null

momentum: 0.99

other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500,"matrix_lr":0.025,"scalar_lr":0.025,"tied_embed_lr":0.035}

Weight Averaging

EMA

parameters: {"decay":0.997}

Compression

zstd

level: null

Evaluation

sliding window eval

parameters: {"stride":64}

Other

other

7-gram backward-looking eval cache with fixed alpha mixing applied during evaluation

parameters: {"alpha":0.4,"order":7,"eval_time_only":true}

LR Schedule

warmdown

parameters: {"warmdown_iters":3000}

Novel Contributions

Exclusive Self-Attention applied to all 11 layers
LeakyReLU(0.5)^2 activation
Value Residual mixing from layer 0 into later layers
Per-head Gated Attention
7-gram backward-looking evaluation cache with fixed alpha mixing
Int6 quantization with zstd compression