val_bpb
1.2154
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.91MB
Training Techniques
Quantization
int8
bits: 8
scope: all
Architecture
tied embeddings
Keeps tok_emb.weight tied and stores it in fp16 during int8 export to reduce quantization damage.
parameters: null
RoPE
Uses NTK-RoPE extrapolation with an optimal evaluation length shorter than maximum context.
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"backend_steps":5}
Evaluation
long context eval
parameters: {"context_length":1408}
Sequence Length
sequence_length
train_length: null
eval_length: 1408
LR Schedule
warmdown
parameters: {"warmdown_iters":20000}
Other
other
Uses FP16 tied embeddings during int8 export and reduces MLP hidden size to 992 to offset the added memory cost.
parameters: {"mlp_hidden":992}
Novel Contributions
- Always-decaying LR schedule with WARMDOWN_ITERS=20000 to reduce post-training quantization penalty.
- Keeping tied embeddings in fp16 during int8 export to preserve accuracy.
- Using NTK-RoPE extrapolation at eval length 1408 as the best setting for well-trained models.
- Finding an optimizer-warmdown interaction where MUON_BACKEND_STEPS=5 outperforms 7 under aggressive warmdown.