PR #715

open

Record: XSA-all + LeakyReLU² + VR + GA + 7-gram cache (val_bpb=1.0337)

by Asukabot0View on GitHub
val_bpb
1.0337
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.99MB

Training Techniques

Quantization
int6
bits: 6
scope: all
Architecture
XSA
Exclusive Self-Attention applied on all 11 layers
parameters: {"layers":11}
LeakyReLU
LeakyReLU(0.5)^2 activation used in place of ReLU^2 to preserve negative gradient flow
parameters: {"negative_slope":0.5,"squared":true}
Value Residual
Layer 0 value output is mixed into subsequent layers via learned sigmoid gates
parameters: null
Gated Attention
Per-head sigmoid gates on attention output
parameters: null
MLP3x
Transformer MLP uses 3x expansion
parameters: {"multiplier":3}
Partial RoPE
Rotary positional embeddings applied to a subset of dimensions
parameters: {"dimensions":"16/64"}
BigramHash
BigramHash feature with 4096 buckets
parameters: {"buckets":4096}
SmearGate
SmearGate component used in the architecture
parameters: null
U-Net skip connections
U-Net style skip connections added to the transformer
parameters: null
GQA
Grouped-query attention with 8 attention heads and 4 KV heads
parameters: {"heads":8,"kv_heads":4}
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500,"matrix_lr":0.025,"scalar_lr":0.025,"tied_embed_lr":0.035}
Weight Averaging
EMA
parameters: {"decay":0.997}
Compression
zstd
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Other
other
7-gram backward-looking eval cache with fixed alpha mixing applied during evaluation
parameters: {"alpha":0.4,"order":7,"eval_time_only":true}
LR Schedule
warmdown
parameters: {"warmdown_iters":3000}

Novel Contributions

  • Exclusive Self-Attention applied to all 11 layers
  • LeakyReLU(0.5)^2 activation
  • Value Residual mixing from layer 0 into later layers
  • Per-head Gated Attention
  • 7-gram backward-looking evaluation cache with fixed alpha mixing
  • Int6 quantization with zstd compression