val_bpb
1.3527
Architecture
Transformer
Optimizer
Muon
Artifact Size
17.7 MB
Training Techniques
Architecture
weight tying
Tied embeddings are enabled.
parameters: null
GQA
Uses 8 query heads and 2 KV heads.
parameters: {"num_heads":8,"num_kv_heads":2}
shared attention
Interior transformer layers share attention modules in adjacent pairs.
parameters: {"shared_pairs":4}
SwiGLU
Replaces the baseline MLP with a PRP-based SwiGLU block.
parameters: {"mlp_mult":4}
PRP
Parametrized random projection MLP with fixed random projection buffer and low-dimensional trainable controls.
parameters: {"rank":32}
Optimizer
Muon
weight_decay: null
momentum: 0.95
other_params: {"matrix_lr":0.03,"scalar_lr":0.03,"prp_lr":0.06,"tied_embed_lr":0.03}
LR Schedule
cosine decay
parameters: {"warmup_steps":2,"main_scale":0.5,"min_scale":0.05,"gamma":0.8}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Compression
zlib
level: null
Novel Contributions
- 21-layer Transformer with shared attention across adjacent interior layer pairs
- PRP-based SwiGLU MLP with low-dimensional trainable controls
- 8k SentencePiece tokenizer and 2048-token training context
- Separate optimizer group for PRP vector controls
- Shared-aware int8 export that deduplicates repeated storage before counting bytes