PR #354

open

[Non-record] MLA + SmearGate + BigramHash + SWA — pre-quant 1.2838 bpb

by Skrisps26View on GitHub
val_bpb
1.2838
Architecture
Transformer
Optimizer
Muon
Artifact Size
14.449MB

Training Techniques

Architecture
MLA
Multi-Head Latent Attention with reduced-rank KV projection to improve parameter efficiency.
parameters: {"kv_rank":128,"num_heads":8,"num_kv_heads":4}
SmearGate
SmearGate MLP using relu^2 gating.
parameters: {"mlp_mult":3}
BigramHash
BigramHash embeddings using hashed bigram buckets.
parameters: {"buckets":10240,"dim":128}
Weight Averaging
SWA
parameters: {"start_frac":0.4,"every":50}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: null
Quantization
mixed int5/int6
bits: null
scope: MLP and attention
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}

Novel Contributions

  • Combines MLA with kv_rank=128 for parameter-efficient attention
  • Introduces SmearGate MLP with relu^2 gating and mlp_mult=3
  • Uses BigramHash embeddings with 10240 buckets and 128-dimensional embeddings
  • Applies SWA during training
  • Uses Muon optimizer with momentum 0.99 and weight decay 0.04
  • Employs mixed int5/int6 quantization with zstd-22 compression
  • Evaluates with sliding-window inference using stride 64