val_bpb
1.1764
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.88MB
Training Techniques
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"matrix_lr":0.02,"scalar_lr":0.02,"tied_embed_lr":0.03,"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Evaluation
sliding window eval
parameters: {"stride":512,"context_length":2048}
LR Schedule
warmdown
parameters: {"warmdown_iters":3000}
Regularization
gradient clipping
parameters: {"norm":0.3}
Architecture
tied embeddings
Uses tied embedding parameters with a separate learning rate.
parameters: null
Novel Contributions
- Training at 2048 tokens performs identically to 4096 tokens under sliding-window evaluation, so shorter training sequences are preferable within the time budget.
- A narrow gradient clipping sweet spot was found for long-sequence training, with 0.3 outperforming other tested values.
- Batch size 786,432 tokens was identified as the best tradeoff for training at 2048-token sequences.
- Quantization-aware warmdown from an earlier PR reduces post-quantization penalty, but only at higher base learning rates.