PR #89

open

record: val_bpb=1.1622, NorMuon + int6 STE + SWA + sliding window

val_bpb
1.1622
Architecture
Transformer
Optimizer
NorMuon
Artifact Size
15.5MB

Training Techniques

Quantization
STE QAT
bits: 6
scope: per-row block weights
fp16
bits: 16
scope: tied embeddings / logit head
Architecture
MLP3x
Wider MLP with 3x hidden size (1536) enabled by int6 compression savings
parameters: {"hidden_dim":1536}
Optimizer
NorMuon
weight_decay: null
momentum: 0.99
other_params: {"matrix_lr":0.02,"scalar_lr":0.02,"tied_embed_lr":0.03,"muon_momentum_warmup_steps":1500,"muon_momentum_warmup_start":0.92}
Weight Averaging
SWA
parameters: {"checkpoints_averaged":7,"checkpoint_interval_steps":200}
Evaluation
sliding window eval
parameters: {"stride":64,"context_length":1024}
Sequence Length
sequence_length
train_length: null
eval_length: 1024
LR Schedule
warmdown
parameters: {"warmdown_iters":3000}
Compression
zstd
level: 22

Novel Contributions

  • Per-row int6 fake quantization with straight-through estimator to reduce post-training quantization gap
  • Keeping the tied embedding/logit head in fp16 to avoid quantization sensitivity
  • Using a wider 3x MLP made possible by int6 compression savings
  • Replacing Muon with NorMuon row-normalized Newton-Schulz updates
  • Applying stochastic weight averaging over the final warmdown checkpoints
  • Using sliding-window evaluation with stride 64 to improve measured val_bpb