val_bpb
1.3039
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.4MB
Training Techniques
Quantization
mixed int5/int6
bits: 5
scope: MLP weights
mixed int5/int6
bits: 6
scope: attention weights
QAT
bits: null
scope: all weights
STE QAT
bits: null
scope: all weights
Compression
zstd
level: 22
Architecture
GQA
Uses grouped query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
U-Net skip connections
Adds U-Net style skip connections to the model.
parameters: null
weight tying
Ties input and output embeddings.
parameters: null
MLP3x
Uses a 3x MLP expansion.
parameters: {"expansion":3}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"Adam_for":"embeddings/scalars"}
Regularization
weight decay
parameters: {"value":0.04}
Novel Contributions
- Mixed INT5/INT6 quantization with INT5 for MLP weights and INT6 for attention weights
- Quantization-aware training from step 1 using fake-quantized forward passes and STE
- Entropy-aware compression perspective showing QAT reduces weight entropy and improves compressibility
- Demonstrated that early QAT substantially outperforms late QAT for post-quantization quality