PR #744

open

WSD Cosine Decay Schedule + 10L Int5-MLP BigramHash SmearGate SWA

by ShihChunHaoView on GitHub
val_bpb
1.2824
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,767,236 bytes

Training Techniques

LR Schedule
Warmup-Stable-Decay cosine schedule
parameters: {"warmup_fraction":0.05,"stable_fraction":0.75,"decay_fraction":0.2}
Architecture
MLP3x
3x expansion MLP in the base model
parameters: null
SmearGate
SmearGate gating mechanism in the model
parameters: null
BigramHash
BigramHash feature with hash size 10240
parameters: {"dimensions":10240}
Quantization
mixed int5/int6
bits: null
scope: MLP int5, attention int6
Weight Averaging
SWA
parameters: {"decay":0.4,"start_frac":0.4,"every_steps":50}
Initialization
Orthogonal init
Orthogonal weight initialization
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: null
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}

Novel Contributions

  • Replaces linear warmdown with a Warmup-Stable-Decay cosine learning rate schedule
  • Uses a long stable peak-LR phase to avoid premature decay under step-limited training budgets
  • Builds on a 10-layer MLP3x SmearGate BigramHash(10240) base model
  • Applies mixed int5/int6 quantization
  • Uses SWA with start fraction 0.4
  • Uses zstd-22 compression