PR #349

open

Record: 11L XSA + EMA + Int5-MLP (val_bpb=1.1399)

val_bpb
1.1399
Architecture
Transformer
Optimizer
Muon
Artifact Size
under 16MB

Training Techniques

Architecture
XSA
Exclusive Self-Attention applied to the last 4 of 11 layers.
parameters: {"layers":4,"total_layers":11}
SmearGate
Custom gating mechanism used in the architecture.
parameters: null
BigramHash
BigramHash feature module with 2048 buckets and 128-dim embeddings.
parameters: {"buckets":2048,"dim":128}
U-Net skip connections
Skip connections inspired by U-Net added to the Transformer.
parameters: null
tied embeddings
Input and output embeddings are tied.
parameters: null
KV head count
Grouped-query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
Weight Averaging
EMA
parameters: {"decay":0.997,"update_frequency":"every step","device":"GPU","dtype":"float32"}
Quantization
mixed int5/int6/int8
bits: null
scope: MLP weights int5, attention weights int6, embeddings int8/FP16 for small tensors
Compression
zstd
level: 22
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"matrix_lr":0.025,"scalar_lr":0.025}
AdamW
weight_decay: null
momentum: null
other_params: {"used_for":"embeddings","tied_embed_lr":0.035}
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
cosine warmdown
parameters: {"warmdown_steps":3000,"warmup_steps":20}
Regularization
weight decay
parameters: {"value":0.04}
magnitude pruning
parameters: {"pruning_ratio":0.08}
Sequence Length
sequence_length
train_length: 2048
eval_length: null

Novel Contributions

  • 11-layer Transformer with XSA applied to the last 4 layers
  • Continuous GPU float32 EMA updated every step without CPU transfers
  • Mixed int5 MLP / int6 attention / int8 embedding quantization
  • 8% magnitude pruning combined with zstd-22 compression
  • Sliding-window evaluation with stride 64
  • Muon optimizer with cosine warmdown schedule