val_bpb
1.2012
Architecture
Transformer
Optimizer
NorMuon
Artifact Size
14.3MB
Training Techniques
Quantization
STE QAT
bits: 6
scope: all
Architecture
tied embeddings
Uses untied input and output embeddings instead of weight tying.
parameters: {"tie_embeddings":0}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"variant":"NorMuon","tuned_learning_rates":{"input_embeddings":0.6,"output_head":0.008}}
Compression
zstd
level: 22
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Other
other
Uses a larger SP4096 SentencePiece BPE tokenizer trained on FineWeb to improve tokens-per-byte compression.
parameters: {"vocab_size":4096,"tokens_per_byte":0.306}
Novel Contributions
- SP4096 tokenizer with improved text compression over sp1024
- Int6 STE QAT combined with zstd-22 artifact compression
- NorMuon optimizer with tuned learning rates
- Untied embeddings to improve BPB