val_bpb
1.1636
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,753,699 bytes
Training Techniques
Architecture
BigramHash
Adds bigram hash features to token embeddings.
parameters: {"dim":256,"vocab_size":4096}
XSA
Applies cross-sequence attention on all layers.
parameters: {"layers":5}
weight tying
Tied embeddings are used.
parameters: null
Partial RoPE
Uses partial rotary positional embeddings.
parameters: {"dims":32}
KV head count
Uses grouped key/value heads.
parameters: {"num_heads":8,"num_kv_heads":4}
MLP6
Uses a 6x MLP expansion in a 5-layer model.
parameters: {"layers":5,"mlp_mult":6,"model_dim":512}
Quantization
int8
bits: 8
scope: all
Compression
brotli
level: 11
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmup_steps":250,"warmdown_iters":1400}
Regularization
logit softcap
parameters: {"value":30}
Initialization
kaiming init
Kaiming initialization is required for BigramHash embeddings.
Novel Contributions
- 4096-vocabulary SentencePiece tokenizer
- 5-layer wider model with MLP expansion 6x
- BigramHash embeddings with kaiming initialization
- Cross-sequence attention on all 5 layers
- Brotli-11 compression for int8 weights