PR #791

open

submission/2026-03-25_WSD_CosineDecay_Schedule

val_bpb

1.2824

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,767,236 bytes

Training Techniques

LR Schedule

Warmup-Stable-Decay cosine schedule

parameters: {"warmup_fraction":0.05,"stable_fraction":0.75,"decay_fraction":0.2}

Architecture

MLP3x

Uses a 3x expansion MLP in the Transformer block.

parameters: null

SmearGate

Applies SmearGate as part of the model architecture.

parameters: null

BigramHash

Uses BigramHash token representation / feature size.

parameters: {"size":10240}

Quantization

mixed int5/int6

bits: null

scope: MLP and attention

Weight Averaging

SWA

parameters: {"start_frac":0.4,"every":50}

Initialization

Orthogonal init

Orthogonal weight initialization.

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: null

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Replaces the default linear warmdown learning-rate schedule with a Warmup-Stable-Decay cosine schedule.
Uses a long stable peak-LR phase to avoid premature decay under step-limited training budgets.
Builds on a strong base configuration with SmearGate, BigramHash, mixed int5/int6 quantization, Muon, SWA, and zstd-22.