PR #1788

open

Non-record- QAT cooldown + INT4 MLP + NuMuon-lite - 1.12 BPB

val_bpb

1.1200

Architecture

Transformer

Optimizer

—

Artifact Size

—

Training Techniques

Quantization

QAT

bits: 6

scope: all

mixed int4/int6

bits: null

scope: MLP and attention

Regularization

weight decay

parameters: null

Other

other

Frobenius-norm penalty applied every 50 steps to encourage low-rank structure for better downstream compression

parameters: {"interval_steps":50,"type":"Frobenius-norm penalty"}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"variant":"NuMuon-lite"}

QAT fused into the cooldown phase instead of applying GPTQ only after training
Mixed precision with INT4 MLP weights and INT6 attention weights
NuMuon-lite Frobenius-norm regularization to encourage low-rank structure and improve GPTQ+Brotli compression