val_bpb
1.1636
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,753,699 bytes
Training Techniques
Architecture
BigramHash
Adds bigram hash features to token embeddings.
parameters: {"dimensions":256,"vocab_size":4096}
XSA
Cross-sequence attention applied to all layers.
parameters: {"layers":5}
weight tying
Tied embeddings are used.
parameters: null
KV head count
Uses grouped-query style key/value head reduction.
parameters: {"num_heads":8,"num_kv_heads":4}
Partial RoPE
Applies RoPE to only part of the head dimension.
parameters: {"dims":32}
MLP6
Uses 6x MLP expansion in a 5-layer model.
parameters: {"layers":5,"mlp_mult":6,"model_dim":512}
Quantization
int8
bits: 8
scope: all
Compression
brotli
level: 11
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmup_steps":250,"warmdown_iters":1400}
Regularization
logit softcap
parameters: {"value":30}
Optimizer
Muon
weight_decay: 0.055
momentum: null
other_params: {"beta2":0.98,"matrix_lr":0.04,"scalar_lr":0.03,"tied_embed_lr":0.03}
Novel Contributions
- 4096-vocabulary SentencePiece tokenizer for more efficient tokenization
- 5-layer, wider MLP6 architecture tuned for a short training budget
- BigramHash embeddings with kaiming initialization
- Cross-sequence attention applied to all 5 layers
- Brotli-11 compression for int8 weights