val_bpb
1.1605
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.28 MB
Training Techniques
Quantization
int6
bits: 6
scope: all large 2D weight matrices
Architecture
MLP3x
Expanded MLP hidden size from baseline 1024 to 1536 (3x expansion) enabled by int6 artifact savings.
parameters: {"MLP_HIDDEN":1536}
MTP auxiliary head
Added a training-only multi-token prediction head predicting token i+2 from hidden state i; excluded from exported artifact.
parameters: {"num_heads":1,"loss_weight":0.01}
tied embeddings
Kept tied embedding matrix in fp16 during export instead of quantizing it.
parameters: {"fp16_export":1}
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"matrix_lr":0.02,"scalar_lr":0.02,"tied_embed_lr":0.03,"muon_momentum_warmup_steps":1500,"muon_momentum_warmup_start":0.92}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":512}
Sequence Length
sequence_length
train_length: 4096
eval_length: 4096
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}
Other
other
Co-optimized training dynamics with lower learning rate, higher momentum, and longer warmdown to improve quantization behavior.
parameters: {"matrix_lr":0.02,"muon_momentum":0.99,"warmdown_iters":3000}
Novel Contributions
- Int6 per-row quantization with zstd-22 compression to reduce artifact size
- 3x wider MLP enabled by quantization savings
- Training-only MTP auxiliary head excluded from the artifact
- FP16 tied embedding passthrough to avoid quantization error on shared embeddings
- Sliding window evaluation with stride 512 for near-full-context scoring
- Long-context training at sequence length 4096
- Training dynamics tuned for better int6 quantization behavior