PR #354
open[Non-record] MLA + SmearGate + BigramHash + SWA — pre-quant 1.2838 bpb
by Skrisps26View on GitHub
val_bpb
1.2838
Architecture
Transformer
Optimizer
Muon
Artifact Size
14.449MB
Training Techniques
Architecture
MLA
Multi-Head Latent Attention with reduced-rank KV projection to improve parameter efficiency.
parameters: {"kv_rank":128,"num_heads":8,"num_kv_heads":4}
SmearGate
SmearGate MLP using relu^2 gating.
parameters: {"mlp_mult":3}
BigramHash
BigramHash embeddings using hashed bigram buckets.
parameters: {"buckets":10240,"dim":128}
Weight Averaging
SWA
parameters: {"start_frac":0.4,"every":50}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: null
Quantization
mixed int5/int6
bits: null
scope: MLP and attention
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Novel Contributions
- Combines MLA with kv_rank=128 for parameter-efficient attention
- Introduces SmearGate MLP with relu^2 gating and mlp_mult=3
- Uses BigramHash embeddings with 10240 buckets and 128-dimensional embeddings
- Applies SWA during training
- Uses Muon optimizer with momentum 0.99 and weight decay 0.04
- Employs mixed int5/int6 quantization with zstd-22 compression
- Evaluates with sliding-window inference using stride 64