PR #567

open

Non-record: 1.366 BPB Baseline (SmearGate + Muon, int6, zstd)

by nitSubediView on GitHub
val_bpb
1.3660
Architecture
Transformer
Optimizer
Muon
Artifact Size
12MB

Training Techniques

Architecture
SmearGate
Smeargate for local context
parameters: null
Optimizer
Muon
weight_decay: 0.02
momentum: null
other_params: null
Quantization
int6
bits: 6
scope: null
Compression
zstd
level: 22

Novel Contributions

  • Use of SmearGate for local context
  • Application of Muon optimizer with 0.02 weight decay
  • Int6 quantization
  • High level (22) zstd compression