val_bpb
1.1575
Architecture
Transformer
Optimizer
Muon
Artifact Size
14.73MB
Training Techniques
Quantization
STE QAT
bits: 6
scope: all
Architecture
SmearGate
Per-dimension gate blending token with predecessor
parameters: null
MLP3x
Expanded feed-forward network width to 3x
parameters: {"multiplier":3}
KV head count
Uses 4 KV heads with 8 attention heads (GQA)
parameters: {"heads":8,"kv_heads":4}
tied embeddings
Input and output embeddings are tied; embedding kept in FP16
parameters: null
Weight Averaging
SWA
parameters: {"every_steps":50,"start_frac":0.5,"num_checkpoints":27}
Optimizer
Muon
weight_decay: 0.038
momentum: 0.99
other_params: {"momentum_warmup":"0.92->0.99 over 1500 steps"}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}
Regularization
weight decay
parameters: {"weight_decay":0.038}
Other
other
Per-dimension SmearGate and step-throughput-focused 10-layer depth tradeoff to maximize training steps under a 10-minute wall-clock budget
parameters: {"layers":10,"step_time_ms":65.49,"steps":9156}
Novel Contributions
- 10-layer configuration chosen to improve step throughput under the 10-minute wall-clock constraint
- Systematic analysis across 17 experiments comparing architecture, LR schedules, quantization, and data scaling
- Int6 QAT with STE combined with per-dimension SmearGate and SWA
- Demonstration that 10L outperforms 11L because faster step time yields more training steps
- Use of sliding window evaluation with stride 64 and zstd-22 compression