PR #107

open

Int6+zstd MLP1488 + Sliding Window + QAT + Tuned LR (val_bpb=1.1648)

by m0atView on GitHub

val_bpb

1.1648

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.93MB

Training Techniques

Quantization

mixed int6 quantization

bits: 6

scope: MLP/Q/V/proj weights

fp16

bits: 16

scope: tied embeddings

STE QAT

bits: null

scope: post-training quantization-aware training

Architecture

MLP3x

Widened MLP hidden size to improve capacity under the artifact budget.

parameters: {"hidden_size":1488}

tied embeddings

Kept the tied embedding/output head in fp16 instead of quantizing it.

parameters: null

Evaluation

sliding window eval

parameters: {"stride":64,"seq_len":2048}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

LR Schedule

warmdown

parameters: {"warmdown_iters":3000}

Optimizer

Muon

weight_decay: null

momentum: 0.99

other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500,"matrix_lr":0.02,"scalar_lr":0.02,"tied_embed_lr":0.03,"grad_clip_norm":0.3,"qk_gain_init":1.7}

Compression

zstd

level: 22

Regularization

gradient clipping

parameters: {"norm":0.3}

Other

other

Fallback from FA3 to SDPA when FA3 is unavailable.

parameters: null

Novel Contributions

Mixed int6 quantization of MLP/Q/V/proj weights to fit a larger model under the artifact budget
Wider MLP hidden size (1488) enabled by quantization savings
Sliding-window evaluation with stride 64 to use more context per scored position
Post-training QAT with STE to reduce quantization penalty
Tuned learning rates for matrix, scalar, and tied embedding parameters
Kept tied embedding in fp16 to avoid quantizing the most sensitive tensor
Longer warmdown schedule to better match the short training budget
FA3 fallback to SDPA for robustness