PR #88

open

Record: Int6 MLP3x + MTP + Sliding Window Eval (val_bpb=1.1605)

by seanwardView on GitHub
val_bpb
1.1605
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.28 MB

Training Techniques

Quantization
int6
bits: 6
scope: all large 2D weight matrices
Architecture
MLP3x
Expanded MLP hidden size from baseline 1024 to 1536 (3x expansion) enabled by int6 artifact savings.
parameters: {"MLP_HIDDEN":1536}
MTP auxiliary head
Added a training-only multi-token prediction head predicting token i+2 from hidden state i; excluded from exported artifact.
parameters: {"num_heads":1,"loss_weight":0.01}
tied embeddings
Kept tied embedding matrix in fp16 during export instead of quantizing it.
parameters: {"fp16_export":1}
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"matrix_lr":0.02,"scalar_lr":0.02,"tied_embed_lr":0.03,"muon_momentum_warmup_steps":1500,"muon_momentum_warmup_start":0.92}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":512}
Sequence Length
sequence_length
train_length: 4096
eval_length: 4096
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}
Other
other
Co-optimized training dynamics with lower learning rate, higher momentum, and longer warmdown to improve quantization behavior.
parameters: {"matrix_lr":0.02,"muon_momentum":0.99,"warmdown_iters":3000}

Novel Contributions

  • Int6 per-row quantization with zstd-22 compression to reduce artifact size
  • 3x wider MLP enabled by quantization savings
  • Training-only MTP auxiliary head excluded from the artifact
  • FP16 tied embedding passthrough to avoid quantization error on shared embeddings
  • Sliding window evaluation with stride 512 for near-full-context scoring
  • Long-context training at sequence length 4096
  • Training dynamics tuned for better int6 quantization behavior