val_bpb
1.1704
Architecture
Transformer
Optimizer
Muon
Artifact Size
13.5MB
Training Techniques
Quantization
int6
bits: 6
scope: all
Architecture
MLP3x
Widened the MLP from 2x to 3x using freed artifact budget from int6 quantization.
parameters: {"hidden":1536}
RoPE
Set RoPE base to 50k via environment variable.
parameters: {"base":50000}
KV head count
Manual KV head repeat for GQA compatibility instead of using enable_gqa.
parameters: null
Initialization
OrthoInit
Orthogonal weight initialization for all linear layers except zero-init layers.
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
cosine warmdown
parameters: {"formula":"0.5 * (1 + cos(pi * progress))"}
Regularization
weight decay
parameters: {"optimizer":"Muon WD"}
Novel Contributions
- Switched from int8 to int6 quantization to free artifact budget
- Used the freed budget to widen the MLP from 2x to 3x
- Replaced linear LR decay with cosine warmdown
- Applied orthogonal initialization to linear layers
- Used zstd level 22 for artifact compression
- Implemented manual KV head repeat for GQA compatibility