val_bpb
1.4120
Architecture
Transformer
Optimizer
Muon
Artifact Size
7.06MB
Training Techniques
Quantization
int8
bits: 8
scope: all
Architecture
tied embeddings
Embedding weights are not tied (TIE_EMBEDDINGS=0)
parameters: {"TIE_EMBEDDINGS":0}
KV head count
Number of key-value heads set to 2
parameters: {"NUM_KV_HEADS":2}
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: null
Evaluation
stride-based eval
parameters: {"stride":64,"eval_batch_seqs":256}
Test-Time Training
TTT
parameters: null
LR Schedule
warmdown
parameters: {"warmdown_iters":300}
Regularization
weight decay
parameters: {"weight_decay":0.04}
Novel Contributions
- Widening model width from 384 to 448 at 8 layers outperforms deeper 9-layer 384-width model
- Test-time training (TTT) provides modest improvements but width scaling is the dominant factor
- Compact model scaling under 16MB artifact size limit with int8 quantization and zlib compression