PR #117

open

submission: Int6 MLP3x + QAT + SlidingWindow (val_bpb: 1.1702)

by trovatochrisView on GitHub

val_bpb

1.1702

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,306,777 bytes

Training Techniques

Quantization

int6

bits: 6

scope: per-row weights

QAT

bits: null

scope: weights

Architecture

MLP3x

Expanded MLP width by 3x.

parameters: {"multiplier":3}

Optimizer

Muon

weight_decay: null

momentum: 0.99

other_params: {"momentum_warmup_steps":1500,"momentum_warmup_start":0.92}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

LR Schedule

warmdown

parameters: {"warmdown_iters":3000}

Other

other

QAT weight-snapping started at 70% of training.

parameters: {"qat_start_frac":0.7}