PR #744
openWSD Cosine Decay Schedule + 10L Int5-MLP BigramHash SmearGate SWA
by ShihChunHaoView on GitHub
val_bpb
1.2824
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,767,236 bytes
Training Techniques
LR Schedule
Warmup-Stable-Decay cosine schedule
parameters: {"warmup_fraction":0.05,"stable_fraction":0.75,"decay_fraction":0.2}
Architecture
MLP3x
3x expansion MLP in the base model
parameters: null
SmearGate
SmearGate gating mechanism in the model
parameters: null
BigramHash
BigramHash feature with hash size 10240
parameters: {"dimensions":10240}
Quantization
mixed int5/int6
bits: null
scope: MLP int5, attention int6
Weight Averaging
SWA
parameters: {"decay":0.4,"start_frac":0.4,"every_steps":50}
Initialization
Orthogonal init
Orthogonal weight initialization
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: null
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Novel Contributions
- Replaces linear warmdown with a Warmup-Stable-Decay cosine learning rate schedule
- Uses a long stable peak-LR phase to avoid premature decay under step-limited training budgets
- Builds on a 10-layer MLP3x SmearGate BigramHash(10240) base model
- Applies mixed int5/int6 quantization
- Uses SWA with start fraction 0.4
- Uses zstd-22 compression