val_bpb
1.1444
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.90MB
Training Techniques
Quantization
STE QAT
bits: 5
scope: MLP
STE QAT
bits: 6
scope: attention
Architecture
BigramHash
Increased bigram hash bucket count to improve bigram coverage.
parameters: {"buckets":12288,"bigram_dim":128}
SmearGate
Uses 3x MLP with SmearGate in the transformer block.
parameters: {"mlp_multiplier":3}
Weight Averaging
SWA
parameters: {"every_steps":25}
Evaluation
stride-based eval
parameters: {"stride":32}
Compression
zstd
level: 22
Regularization
magnitude pruning
parameters: {"sparsity":0.05}
Novel Contributions
- Applied QAT with STE fake quantization to reduce post-quantization degradation
- Used mixed precision quantization with int5 MLP and int6 attention
- Expanded BigramHash from 10240 to 12288 buckets
- Reduced evaluation stride from 64 to 32
- Applied 5% magnitude pruning
- Used SWA during training