PR #147

open

Record/smaller batch sota, val_bpb 1.16314679 (post-quant, int6+zlib, sliding eval)

by ankitmalooView on GitHub
val_bpb
1.1631
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,934,552 bytes

Training Techniques

Quantization
int6
bits: 6
scope: all
int8
bits: 8
scope: all
Compression
zlib
level: null
Evaluation
sliding window eval
parameters: null
Architecture
tied embeddings
Input and output embeddings are tied.
parameters: null
KV head count
Uses fewer KV heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":1500}
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}
Sequence Length
sequence_length
train_length: 4096
eval_length: 4096
Other
other
Tighter int8 clipping to retain more of the weight distribution tail.
parameters: {"int8_clip_percentile":99.99995}
other
Higher-precision per-row scales to reduce scale quantization error.
parameters: {"int8_per_row_scale_dtype":"float32"}

Novel Contributions

  • Tighter int8 clipping percentile to preserve more tail weights during quantization
  • Higher-precision per-row int8 scales using float32
  • Strong Muon optimizer tuning with momentum warmup and extended warmdown
  • Sliding window evaluation
  • Smaller batch training setup on seq4096 trunk