PR #744

open

WSD Cosine Decay Schedule + 10L Int5-MLP BigramHash SmearGate SWA

val_bpb

1.2824

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,767,236 bytes

Training Techniques

LR Schedule

Warmup-Stable-Decay cosine schedule

parameters: {"warmup_fraction":0.05,"stable_fraction":0.75,"decay_fraction":0.2}

Architecture

MLP3x

3x expansion MLP in the base model

parameters: null

SmearGate

SmearGate gating mechanism in the model

parameters: null

BigramHash

BigramHash feature with hash size 10240

parameters: {"dimensions":10240}

Quantization

mixed int5/int6

bits: null

scope: MLP int5, attention int6

Weight Averaging

SWA

parameters: {"decay":0.4,"start_frac":0.4,"every_steps":50}

Initialization

Orthogonal init

Orthogonal weight initialization

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: null

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Replaces linear warmdown with a Warmup-Stable-Decay cosine learning rate schedule
Uses a long stable peak-LR phase to avoid premature decay under step-limited training budgets
Builds on a 10-layer MLP3x SmearGate BigramHash(10240) base model
Applies mixed int5/int6 quantization
Uses SWA with start fraction 0.4
Uses zstd-22 compression