PR #403

open

Submit Int6 QAT parameter-golf entry

val_bpb

1.1388

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.85 MB

Training Techniques

Quantization

int6 QAT

bits: 6

scope: all

Architecture

MLP3x

Expanded MLP capacity to 3x size using space saved by int6 quantization.

parameters: null

SmearGate

Adds a complementary bigram-context signal at the embedding layer.

parameters: null

BigramHash

Adds a bigram-context hashing signal at the embedding layer.

parameters: null

Initialization

Orthogonal init

Orthogonal weight initialization to accelerate early convergence.

Regularization

weight decay

parameters: {"weight_decay":0.04}

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: {"decoupled_weight_decay":true}

Weight Averaging

SWA

parameters: {"interval_steps":50,"start_fraction":0.5}

Evaluation

sliding window eval

parameters: {"stride":64}

Int6 QAT with STE enabled from 30% of training onward to reduce post-training quantization penalty
3x MLP expansion funded by the byte savings from int6 quantization
SmearGate and BigramHash as complementary bigram-context signals at the embedding layer
Orthogonal initialization and output-projection scaling for faster early convergence
Muon optimizer with decoupled weight decay of 0.04 to improve quantization quality
SWA applied at 50-step intervals over the last 50% of training
Sliding-window evaluation with stride 64