val_bpb
1.4222
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,576,677 bytes
Training Techniques
Quantization
mixed int5/int6 with QAT
bits: null
scope: int5 for MLP weights, int6 for attention/bigram-sensitive weights
Architecture
BigramHash
BigramHash embedding added to model
parameters: {"BIGRAM_VOCAB_SIZE":10240,"BIGRAM_DIM":128}
MLP3x
3x MLP expansion
parameters: {"MLP_MULT":3,"NUM_LAYERS":10}
Weight Averaging
EMA
parameters: {"decay":0.9999}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"MATRIX_LR":0.02,"SCALAR_LR":0.04,"TIED_EMBED_LR":0.04}
Regularization
weight decay
parameters: {"weight_decay":0.04}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_iters":160}
Novel Contributions
- Mixed quantization using int5 for MLP weights and int6 for attention/bigram-sensitive weights
- Use of EMA (Exponential Moving Average) for export-time weights with high decay
- Final-fraction QAT (Quantization Aware Training) with QAT_FINAL_FRAC=0.15
- Incorporation of BigramHash embedding with large vocab size and dimension
- 3x MLP expansion in a 10-layer Transformer model
- Use of Muon optimizer with specific learning rates and momentum tuning
- Compression of final artifact under 16MB using int8+zlib