val_bpb
1.1622
Architecture
Transformer
Optimizer
NorMuon
Artifact Size
15.5MB
Training Techniques
Quantization
STE QAT
bits: 6
scope: per-row block weights
fp16
bits: 16
scope: tied embeddings / logit head
Architecture
MLP3x
Wider MLP with 3x hidden size (1536) enabled by int6 compression savings
parameters: {"hidden_dim":1536}
Optimizer
NorMuon
weight_decay: null
momentum: 0.99
other_params: {"matrix_lr":0.02,"scalar_lr":0.02,"tied_embed_lr":0.03,"muon_momentum_warmup_steps":1500,"muon_momentum_warmup_start":0.92}
Weight Averaging
SWA
parameters: {"checkpoints_averaged":7,"checkpoint_interval_steps":200}
Evaluation
sliding window eval
parameters: {"stride":64,"context_length":1024}
Sequence Length
sequence_length
train_length: null
eval_length: 1024
LR Schedule
warmdown
parameters: {"warmdown_iters":3000}
Compression
zstd
level: 22
Novel Contributions
- Per-row int6 fake quantization with straight-through estimator to reduce post-training quantization gap
- Keeping the tied embedding/logit head in fp16 to avoid quantization sensitivity
- Using a wider 3x MLP made possible by int6 compression savings
- Replacing Muon with NorMuon row-normalized Newton-Schulz updates
- Applying stochastic weight averaging over the final warmdown checkpoints
- Using sliding-window evaluation with stride 64 to improve measured val_bpb