PR #365

open

submission: 10L Int5-MLP + Aggressive Warmdown (WD=20000) — targeting <1.14 bpb

by outsourc-eView on GitHub
val_bpb
1.1574
Architecture
10L Transformer
Optimizer
Muon
Artifact Size

Training Techniques

Quantization
int5
bits: 5
scope: MLP
Architecture
BigramHash
Uses BigramHash as part of the model setup.
parameters: {"dimensions":10240}
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: null
Weight Averaging
SWA
parameters: {"start_frac":0.4,"interval":50}
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
warmdown
parameters: {"warmdown_iters":20000}

Novel Contributions

  • Aggressive warmdown with warmdown_iters set to 20000, making the entire training run a decay phase
  • Reported improved post-quantization quality compared with shorter warmdown schedules
  • Observed lower post-quantization penalty under Int5/Int6 quantization
  • Combined Int5 MLP, BigramHash 10240, MuonWD 0.04, and SWA with sliding-window evaluation