val_bpb
1.2091
Architecture
Transformer
Optimizer
Muon
Artifact Size
13.2MB
Training Techniques
Architecture
tied embeddings
Keeps tok_emb.weight in fp16 instead of int8 to avoid quantization degradation in tied input/output embeddings.
parameters: null
RoPE
Uses a larger RoPE base to improve performance.
parameters: {"base":50000}
KV head count
Uses fewer KV heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
SwiGLU MLP
Uses a wider SwiGLU feed-forward block with multiplier 2.
parameters: {"layers":7,"dim":576,"mlp_mult":2}
Optimizer
Muon
weight_decay: 0.02
momentum: null
other_params: {"beta2":0.99}
Regularization
weight decay
parameters: {"value":0.02}
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
warmdown
parameters: {"warmdown_frac":0.6}
Quantization
fp16
bits: 16
scope: embeddings
Other
other
Uses wallclock-based warmdown at 60% of training and a larger batch/LR configuration.
parameters: {"train_batch_tokens":262144,"matrix_lr":0.03,"scalar_lr":0.03,"tied_embed_lr":0.04}
Novel Contributions
- Wider Transformer model with dim=576 and 7 layers using SwiGLU MLPs
- Muon optimizer with decoupled weight decay 0.02
- FP16 embedding passthrough to reduce tied-embedding quantization degradation
- Sliding window evaluation with stride 64 for improved validation BPB
- Wallclock-based warmdown at 60%
- RoPE base 50K, beta2=0.99, and tuned batch/LR settings