val_bpb
1.2098
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,872,012 bytes
Training Techniques
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Optimizer
Muon
weight_decay: null
momentum: 0.985
other_params: {"warmup_from":0.9,"warmup_steps":500}
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}
Weight Averaging
EMA
parameters: {"decay":0.997}
Quantization
GPTQ-lite
bits: 8
scope: per-row
Architecture
weight tying
Tied input and output embeddings.
parameters: null
KV head count
Used grouped-query style attention with fewer KV heads than query heads.
parameters: {"num_heads":8,"num_kv_heads":4}
Novel Contributions
- Longer training context length
- Muon momentum warmup
- Extended warmdown schedule
- EMA weight averaging
- Per-row GPTQ-lite int8 quantization
- Wallclock-aware training schedule
- Tied embeddings
- Reduced KV head count