val_bpb
1.1456
Architecture
Transformer
Optimizer
—
Artifact Size
15.14 MB
Training Techniques
Quantization
mixed int6/int5/int4
bits: 6
scope: MLP and attention
Architecture
MoE
2-expert soft-routing mixture-of-experts replacing the dense MLP for parameter expansion
parameters: {"experts":2,"expert_multiplier":1.5}
Weight Averaging
SWA
parameters: {"checkpoints":30}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Other
other
Quantization comparison across multiple attention/MLP bit-width configurations on the same trained dense model
parameters: {"configurations":["attn6_mlp6","attn6_mlp5","attn6_mlp4","attn5_mlp5","attn5_mlp4"]}
Novel Contributions
- Preliminary negative result for a 2-expert soft-routing MoE under the 16MB artifact cap
- Leaderboard-relevant comparison of multi-bit post-training quantization on the same dense model
- Evidence that int5 MLP quantization is viable while int4 MLP quantization is destructive in this setup
- Partial MoE training log and checkpoint table documenting the observed degradation relative to dense control