PR #358

open

Feature/sota optimizations

by adityagupta26View on GitHub
val_bpb
1.1400
Architecture
Optimizer
Muon
Artifact Size

Training Techniques

Architecture
BigramHash
Adds token-pair hashing for cheap local context.
parameters: null
SmearGate
Learns a gate to blend information between adjacent tokens.
parameters: null
Initialization
OrthoInit
Linear layers use orthogonal initialization.
Quantization
STE QAT
bits: 8
scope: all
Weight Averaging
SWA
parameters: {"phase":"warmdown"}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"weight_decay_support":true}
Regularization
weight decay
parameters: null
Other
other
Magnitude pruning zeros out the smallest 3% of weights post-training.
parameters: {"prune_fraction":0.03}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}

Novel Contributions

  • BigramHash embedding for cheap local context
  • SmearGate for blending adjacent token information
  • Orthogonal initialization for linear layers
  • STE-based quantization-aware training
  • Stochastic Weight Averaging during warmdown
  • Muon optimizer with weight decay support
  • Magnitude pruning of the smallest 3% of weights
  • Maximum Zstandard compression for the artifact
  • Sliding window evaluation with stride 64