PR #120

closed

[Val Only]: MLP 3x + STE int6 QAT + sliding window, val_bpb=0.9588

by andrewgcodesView on GitHub
val_bpb
0.9588
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,381,981 bytes

Training Techniques

Architecture
MLP3x
Expanded MLP hidden dimension to 1536, a 3x feedforward expansion.
parameters: {"hidden_dim":1536,"layers":9,"model_dim":512,"num_heads":8,"num_kv_heads":4}
RoPE
Extended RoPE base frequency for improved long-range position encoding.
parameters: {"base":200000}
Quantization
STE QAT
bits: 6
scope: transformer blocks
int8 per-row
bits: 8
scope: embeddings
mixed int6/int8
bits: null
scope: transformer blocks and embeddings
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
warmdown
parameters: {"warmdown_steps":14000,"schedule":"cosine decay"}
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"lr":0.025,"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}
Other
other
Val-only training on the validation shard for memorization.
parameters: null

Novel Contributions

  • MLP 3x expansion with hidden dimension 1536
  • STE fake-int6 quantization-aware training
  • Mixed post-training quantization with int6 transformer blocks and int8 embeddings
  • Sliding window evaluation with stride 64
  • Extended RoPE base frequency of 200,000
  • Extended warmdown cosine learning rate decay
  • Tuned Muon optimizer settings
  • Val-only training on the validation shard