PR #70

open

Submission: Wider MLP 3x + int6 quant + sliding window eval, val_bpb=1.1659

by jfprinczView on GitHub
val_bpb
1.1659
Architecture
Transformer
Optimizer
Muon
Artifact Size
14,855,508 bytes

Training Techniques

Architecture
MLP3x
Widened the MLP expansion from 2x to 3x (hidden size 1536) to improve performance.
parameters: {"mlp_mult":3,"hidden_size":1536}
tied embeddings
Uses tied input/output embeddings.
parameters: null
KV head count
Uses fewer KV heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
Quantization
mixed int6/int8
bits: 6
scope: int6 per-row on MLP and attention projection weights; int8 per-row on embeddings and other tensors
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":256}
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"matrix_lr":0.02,"scalar_lr":0.02,"tied_embed_lr":0.03,"momentum_warmup_steps":1500,"momentum_warmup_start":0.92,"warmdown_iters":3000}
LR Schedule
warmdown
parameters: {"warmdown_iters":3000,"momentum_warmup_steps":1500,"momentum_warmup_start":0.92}
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024

Novel Contributions

  • Wider 3x MLP expansion to increase model capacity while staying under the artifact limit
  • Mixed precision quantization with int6 per-row for MLP and attention weights and int8 for embeddings/other tensors
  • Sliding window evaluation with stride 256 to improve validation score using more context per scored token
  • Use of zstd level 22 compression to fit the larger model within the 16MB submission limit
  • Optimizer tuning for Muon with custom learning rates and momentum warmup/warmdown settings