PR #365
opensubmission: 10L Int5-MLP + Aggressive Warmdown (WD=20000) — targeting <1.14 bpb
by outsourc-eView on GitHub
val_bpb
1.1574
Architecture
10L Transformer
Optimizer
Muon
Artifact Size
—
Training Techniques
Quantization
int5
bits: 5
scope: MLP
Architecture
BigramHash
Uses BigramHash as part of the model setup.
parameters: {"dimensions":10240}
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: null
Weight Averaging
SWA
parameters: {"start_frac":0.4,"interval":50}
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
warmdown
parameters: {"warmdown_iters":20000}
Novel Contributions
- Aggressive warmdown with warmdown_iters set to 20000, making the entire training run a decay phase
- Reported improved post-quantization quality compared with shorter warmdown schedules
- Observed lower post-quantization penalty under Int5/Int6 quantization
- Combined Int5 MLP, BigramHash 10240, MuonWD 0.04, and SWA with sliding-window evaluation