val_bpb
1.1704
Architecture
Transformer
Optimizer
Muon
Artifact Size
13.5MB
Training Techniques
Quantization
int6
bits: 6
scope: all
Architecture
MLP3x
Widened the MLP from 2x to 3x using freed artifact space from int6 quantization.
parameters: {"hidden":1536}
LR Schedule
cosine warmdown
parameters: {"formula":"0.5 * (1 + cos(pi * progress))"}
Initialization
OrthoInit
Orthogonal weight initialization for all linear layers except zero-init ones.
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Other
other
RoPE base set to 50k via environment variable.
parameters: {"rope_base":50000}
other
Manual KV head repeat used for GQA compatibility instead of enable_gqa flag.
parameters: null
Novel Contributions
- Switched from int8 to int6 quantization to free artifact budget
- Used freed space to widen the MLP from 2x to 3x
- Replaced linear LR decay with cosine warmdown
- Applied orthogonal initialization to linear layers
- Used zstd level 22 for artifact compression
- Set RoPE base to 50k
- Implemented a GQA compatibility fix via manual KV head repeat