PR #120
closed[Val Only]: MLP 3x + STE int6 QAT + sliding window, val_bpb=0.9588
by andrewgcodesView on GitHub
val_bpb
0.9588
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,381,981 bytes
Training Techniques
Architecture
MLP3x
Expanded MLP hidden dimension to 1536, a 3x feedforward expansion.
parameters: {"hidden_dim":1536,"layers":9,"model_dim":512,"num_heads":8,"num_kv_heads":4}
RoPE
Extended RoPE base frequency for improved long-range position encoding.
parameters: {"base":200000}
Quantization
STE QAT
bits: 6
scope: transformer blocks
int8 per-row
bits: 8
scope: embeddings
mixed int6/int8
bits: null
scope: transformer blocks and embeddings
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
warmdown
parameters: {"warmdown_steps":14000,"schedule":"cosine decay"}
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"lr":0.025,"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}
Other
other
Val-only training on the validation shard for memorization.
parameters: null
Novel Contributions
- MLP 3x expansion with hidden dimension 1536
- STE fake-int6 quantization-aware training
- Mixed post-training quantization with int6 transformer blocks and int8 embeddings
- Sliding window evaluation with stride 64
- Extended RoPE base frequency of 200,000
- Extended warmdown cosine learning rate decay
- Tuned Muon optimizer settings
- Val-only training on the validation shard