val_bpb
1.3660
Architecture
Transformer
Optimizer
Muon
Artifact Size
12MB
Training Techniques
Architecture
SmearGate
Smeargate for local context
parameters: null
Optimizer
Muon
weight_decay: 0.02
momentum: null
other_params: null
Quantization
int6
bits: 6
scope: null
Compression
zstd
level: 22
Novel Contributions
- Use of SmearGate for local context
- Application of Muon optimizer with 0.02 weight decay
- Int6 quantization
- High level (22) zstd compression