PR #128

open

Record: Int6 MLP3x + STE QAT + Sliding Window (val_bpb=1.1594)

by rsavittView on GitHub
val_bpb
1.1594
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,162,777 bytes

Training Techniques

Quantization
int6
bits: 6
scope: MLP and attention weights; tied embeddings kept fp16
STE QAT
bits: 6
scope: weights
Architecture
MLP3x
Expanded MLP hidden size to 3x baseline using int6 savings
parameters: {"mlp_mult":3,"hidden":1536}
tied embeddings
Kept tied token embedding/output head in fp16 passthrough to avoid quantization penalty
parameters: {"tie_embeddings":1}
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"matrix_lr":0.02,"scalar_lr":0.02,"tied_embed_lr":0.03,"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":1500}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64,"context_length":4096}
Sequence Length
sequence_length
train_length: 4096
eval_length: 4096
LR Schedule
warmdown
parameters: {"warmdown_iters":3000}
Other
other
Tuned training dynamics with wallclock-limited training and large batch token count
parameters: {"train_batch_tokens":393216,"max_wallclock_seconds":600}

Novel Contributions

  • Int6 per-row quantization plus zstd-22 compression to fit a wider model within the 16MB budget
  • 3x MLP expansion enabled by quantization savings
  • STE fake int6 quantization-aware training to improve post-quantization robustness
  • fp16 tied embedding passthrough to preserve output head quality
  • Sliding window evaluation with stride 64 for near-full-context scoring
  • Co-optimized training dynamics including Muon momentum tuning and warmdown schedule