PR #191

open

Record: Compression-Funded MLP3x (val_bpb=1.1598)

by chris-buckleyView on GitHub
val_bpb
1.1598
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.9 MB

Training Techniques

Quantization
int6
bits: 6
scope: all large weight matrices
fp16
bits: 16
scope: tied embeddings and last two c_k weights
Architecture
MLP3x
Widened the MLP from 2x to 3x using saved artifact budget
parameters: {"mlp_mult":3}
tied embeddings
Input and output embeddings are tied
parameters: null
KV head count
Uses fewer KV heads than attention heads
parameters: {"num_heads":8,"num_kv_heads":4}
Compression
zlib
level: null
Evaluation
sliding window eval
parameters: {"stride":256}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmup/warmdown
parameters: {"warmup_steps":1500,"warmdown_iters":3000}
Regularization
gradient clipping
parameters: {"grad_clip_norm":0.3}
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}
Other
other
Used seq2048 long-context training recipe with tuned learning rates and full validation split scoring
parameters: {"train_batch_tokens":786432,"matrix_lr":0.02,"scalar_lr":0.02,"tied_embed_lr":0.03}

Novel Contributions

  • Int6 block-weight compression to free artifact budget
  • Widening the MLP from 2x to 3x using the saved bytes
  • Keeping tied embeddings and selected attention weights in fp16 while compressing large matrices
  • Seq2048 long-context training recipe
  • Stride-256 sliding-window evaluation
  • Muon momentum warmup and tuned learning rates