PR #365

open

submission: 10L Int5-MLP + Aggressive Warmdown (WD=20000) — targeting <1.14 bpb

val_bpb

1.1574

Architecture

10L Transformer

Optimizer

Muon

Artifact Size

—

Training Techniques

Quantization

int5

bits: 5

scope: MLP

Architecture

BigramHash

Uses BigramHash as part of the model setup.

parameters: {"dimensions":10240}

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: null

Weight Averaging

SWA

parameters: {"start_frac":0.4,"interval":50}

Evaluation

sliding window eval

parameters: {"stride":64}

LR Schedule

warmdown

parameters: {"warmdown_iters":20000}

Aggressive warmdown with warmdown_iters set to 20000, making the entire training run a decay phase
Reported improved post-quantization quality compared with shorter warmdown schedules
Observed lower post-quantization penalty under Int5/Int6 quantization
Combined Int5 MLP, BigramHash 10240, MuonWD 0.04, and SWA with sliding-window evaluation