PR #107

open

Int6+zstd MLP1488 + Sliding Window + QAT + Tuned LR (val_bpb=1.1648)

val_bpb
1.1648
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.93MB

Training Techniques

Quantization
mixed int6 quantization
bits: 6
scope: MLP/Q/V/proj weights
fp16
bits: 16
scope: tied embeddings
STE QAT
bits: null
scope: post-training quantization-aware training
Architecture
MLP3x
Widened MLP hidden size to improve capacity under the artifact budget.
parameters: {"hidden_size":1488}
tied embeddings
Kept the tied embedding/output head in fp16 instead of quantizing it.
parameters: null
Evaluation
sliding window eval
parameters: {"stride":64,"seq_len":2048}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_iters":3000}
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500,"matrix_lr":0.02,"scalar_lr":0.02,"tied_embed_lr":0.03,"grad_clip_norm":0.3,"qk_gain_init":1.7}
Compression
zstd
level: 22
Regularization
gradient clipping
parameters: {"norm":0.3}
Other
other
Fallback from FA3 to SDPA when FA3 is unavailable.
parameters: null

Novel Contributions

  • Mixed int6 quantization of MLP/Q/V/proj weights to fit a larger model under the artifact budget
  • Wider MLP hidden size (1488) enabled by quantization savings
  • Sliding-window evaluation with stride 64 to use more context per scored position
  • Post-training QAT with STE to reduce quantization penalty
  • Tuned learning rates for matrix, scalar, and tied embedding parameters
  • Kept tied embedding in fp16 to avoid quantizing the most sensitive tensor
  • Longer warmdown schedule to better match the short training budget
  • FA3 fallback to SDPA for robustness