val_bpb
1.1933
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.77 MB
Training Techniques
Architecture
MLP3x
Uses a 3x MLP expansion width (hidden=1536) instead of the baseline 2x width.
parameters: {"layers":7,"width":512,"hidden":1536,"attention_heads":8,"kv_heads":4}
tied embeddings
Input and output embeddings are tied.
parameters: null
KV head count
Uses grouped-query attention with fewer KV heads than attention heads.
parameters: {"attention_heads":8,"kv_heads":4}
extra RMSNorm
Adds an extra RMSNorm before attention and MLP output projections.
parameters: null
Sequence Length
sequence_length
train_length: 4096
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_iters":3000}
Other
other
Lower tied embedding learning rate to smooth weight distributions and reduce quantization gap.
parameters: {"tied_embed_lr":0.01}
other
Lower matrix learning rate for smoother Muon updates.
parameters: {"matrix_lr":0.03}
other
Reduced logit softcap to tighten output distribution and help quantization.
parameters: {"logit_softcap":15}
other
Adjusted qk gain initialization.
parameters: {"qk_gain_init":1}
Quantization
int8
bits: 8
scope: all
Compression
zlib
level: null
Novel Contributions
- 7-layer 512-dim transformer with 3x MLP width while keeping roughly the same parameter count as baseline
- Training at sequence length 4096 for improved per-step quality
- Lower tied embedding learning rate to create smoother weights and dramatically reduce quantization gap
- Carefully tuned learning rates for matrix parameters and embeddings
- Reduced logit softcap to improve both training and quantization
- Longer warmdown schedule for better generalization
- Standard int8 quantization with zlib compression and no QAT