PR #89

open

record: val_bpb=1.1622, NorMuon + int6 STE + SWA + sliding window

val_bpb

1.1622

Architecture

Transformer

Optimizer

NorMuon

Artifact Size

15.5MB

Training Techniques

Quantization

STE QAT

bits: 6

scope: per-row block weights

fp16

bits: 16

scope: tied embeddings / logit head

Architecture

MLP3x

Wider MLP with 3x hidden size (1536) enabled by int6 compression savings

parameters: {"hidden_dim":1536}

Optimizer

NorMuon

weight_decay: null

momentum: 0.99

other_params: {"matrix_lr":0.02,"scalar_lr":0.02,"tied_embed_lr":0.03,"muon_momentum_warmup_steps":1500,"muon_momentum_warmup_start":0.92}

Weight Averaging

SWA

parameters: {"checkpoints_averaged":7,"checkpoint_interval_steps":200}

Evaluation

sliding window eval

parameters: {"stride":64,"context_length":1024}

Sequence Length

sequence_length

train_length: null

eval_length: 1024

LR Schedule

warmdown

parameters: {"warmdown_iters":3000}

Compression

zstd

level: 22

Per-row int6 fake quantization with straight-through estimator to reduce post-training quantization gap
Keeping the tied embedding/logit head in fp16 to avoid quantization sensitivity
Using a wider 3x MLP made possible by int6 compression savings
Replacing Muon with NorMuon row-normalized Newton-Schulz updates
Applying stochastic weight averaging over the final warmdown checkpoints
Using sliding-window evaluation with stride 64 to improve measured val_bpb