val_bpb
1.1400
Architecture
—
Optimizer
Muon
Artifact Size
—
Training Techniques
Architecture
BigramHash
Adds token-pair hashing for cheap local context.
parameters: null
SmearGate
Learns a gate to blend information between adjacent tokens.
parameters: null
Initialization
OrthoInit
Linear layers use orthogonal initialization.
Quantization
STE QAT
bits: 8
scope: all
Weight Averaging
SWA
parameters: {"phase":"warmdown"}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"weight_decay_support":true}
Regularization
weight decay
parameters: null
Other
other
Magnitude pruning zeros out the smallest 3% of weights post-training.
parameters: {"prune_fraction":0.03}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Novel Contributions
- BigramHash embedding for cheap local context
- SmearGate for blending adjacent token information
- Orthogonal initialization for linear layers
- STE-based quantization-aware training
- Stochastic Weight Averaging during warmdown
- Muon optimizer with weight decay support
- Magnitude pruning of the smallest 3% of weights
- Maximum Zstandard compression for the artifact
- Sliding window evaluation with stride 64