val_bpb
1.1925
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,934,552 bytes
Training Techniques
Quantization
int8
bits: 8
scope: all
Architecture
tied embeddings
Input and output embeddings are tied.
parameters: null
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}
Compression
zlib
level: null
Sequence Length
sequence_length
train_length: 4096
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}
Other
other
Tighter int8 clipping percentile to retain more of the weight distribution tail.
parameters: {"int8_clip_percentile":99.99995}
other
Higher-precision per-row quantization scales using float32 instead of float16.
parameters: {"int8_per_row_scale_dtype":"float32"}
Novel Contributions
- Tighter int8 clipping percentile (99.99995) to preserve more tail weights
- Higher-precision per-row int8 scales using float32
- Muon optimizer tuning with momentum 0.99 and momentum warmup
- Extended warmdown schedule