val_bpb
1.3540
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.15 MB
Training Techniques
Architecture
BigramHash
Specialist MoE with 128 bigram-gating specialists (FastClusterGating) for clustered token routing.
parameters: {"specialists":128}
Transformer
800-dimensional Transformer with 6 layers and 10 attention heads per layer.
parameters: {"dimensions":800,"layers":6,"heads":10}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"use_for":"internal representations"}
Adam
weight_decay: null
momentum: null
other_params: {"use_for":"clusters"}
Weight Averaging
SWA
parameters: null
Quantization
mixed int5/int6
bits: 5
scope: MLP/rest
Compression
zlib
level: null
Other
other
Hard 600-second wallclock limit enforced in the training script.
parameters: {"max_wallclock_seconds":600}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Novel Contributions
- 128-cluster specialist MoE architecture with FastClusterGating
- Hybrid 5-bit MLP / 6-bit rest quantization scheme
- Stochastic Weight Averaging for improved generalization
- Muon optimizer for internal representations with Adam for clusters
- Hard 600-second wallclock compliance guard
- Artifact size kept under the 16MB limit