PR #444
open[Non-Record] MLP3x + WD0.04 + OrthoInit + Sliding Eval — 1.4536 BPB
by AymanMahfuz27View on GitHub
val_bpb
1.4536
Architecture
Transformer
Optimizer
Muon
Artifact Size
14,660,530 bytes
Training Techniques
Architecture
MLP3x
Widens the feedforward hidden dimension from 2x model_dim to 3x model_dim.
parameters: {"mlp_mult":3}
tied embeddings
Uses tied input/output embeddings.
parameters: null
KV head count
Uses fewer KV heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
Regularization
weight decay
parameters: {"weight_decay":0.04}
Initialization
OrthoInit
Uses orthogonal initialization for 2D CastedLinear weights.
Evaluation
sliding window eval
parameters: {"stride":64}
Compression
zlib
level: null
Quantization
int8
bits: 8
scope: all
Weight Averaging
SWA
parameters: null
Novel Contributions
- MLP width multiplier of 3x as the main architecture improvement
- Decoupled weight decay in Muon to improve post-quantization BPB and reduce quantization gap
- Orthogonal initialization for linear weights
- Sliding-window evaluation with stride 64 for better validation BPB
- Implementation of additional optional features including int6 quantization, QAT, bigram hash embeddings, and zstd compression
- Empirical finding that SWA is quantization-hostile in this setting