val_bpb
1.2824
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,767,236 bytes
Training Techniques
LR Schedule
Warmup-Stable-Decay cosine schedule
parameters: {"warmup_fraction":0.05,"stable_fraction":0.75,"decay_fraction":0.2}
Architecture
MLP3x
Uses a 3x expansion MLP in the Transformer block.
parameters: null
SmearGate
Applies SmearGate as part of the model architecture.
parameters: null
BigramHash
Uses BigramHash token representation / feature size.
parameters: {"size":10240}
Quantization
mixed int5/int6
bits: null
scope: MLP and attention
Weight Averaging
SWA
parameters: {"start_frac":0.4,"every":50}
Initialization
Orthogonal init
Orthogonal weight initialization.
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: null
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Novel Contributions
- Replaces the default linear warmdown learning-rate schedule with a Warmup-Stable-Decay cosine schedule.
- Uses a long stable peak-LR phase to avoid premature decay under step-limited training budgets.
- Builds on a strong base configuration with SmearGate, BigramHash, mixed int5/int6 quantization, Muon, SWA, and zstd-22.