PR #88

open

Record: Int6 MLP3x + MTP + Sliding Window Eval (val_bpb=1.1605)

by seanwardView on GitHub

val_bpb

1.1605

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.28 MB

Training Techniques

Quantization

int6

bits: 6

scope: all large 2D weight matrices

Architecture

MLP3x

Expanded MLP hidden size from baseline 1024 to 1536 (3x expansion) enabled by int6 artifact savings.

parameters: {"MLP_HIDDEN":1536}

MTP auxiliary head

Added a training-only multi-token prediction head predicting token i+2 from hidden state i; excluded from exported artifact.

parameters: {"num_heads":1,"loss_weight":0.01}

tied embeddings

Kept tied embedding matrix in fp16 during export instead of quantizing it.

parameters: {"fp16_export":1}

Optimizer

Muon

weight_decay: null

momentum: 0.99

other_params: {"matrix_lr":0.02,"scalar_lr":0.02,"tied_embed_lr":0.03,"muon_momentum_warmup_steps":1500,"muon_momentum_warmup_start":0.92}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":512}

Sequence Length

sequence_length

train_length: 4096

eval_length: 4096

LR Schedule

warmdown

parameters: {"warmdown_steps":3000}

Other

other

Co-optimized training dynamics with lower learning rate, higher momentum, and longer warmdown to improve quantization behavior.

parameters: {"matrix_lr":0.02,"muon_momentum":0.99,"warmdown_iters":3000}

Novel Contributions

Int6 per-row quantization with zstd-22 compression to reduce artifact size
3x wider MLP enabled by quantization savings
Training-only MTP auxiliary head excluded from the artifact
FP16 tied embedding passthrough to avoid quantization error on shared embeddings
Sliding window evaluation with stride 512 for near-full-context scoring
Long-context training at sequence length 4096
Training dynamics tuned for better int6 quantization behavior