val_bpb
1.4530
Architecture
Standard GPT / Transformer
Optimizer
Muon
Artifact Size
15.35 MB
Training Techniques
Architecture
MLP3x
Increased MLP width to 3x the default feedforward size, using a 1536 hidden dimension instead of 1024.
parameters: {"layers":9,"model_dim":512,"heads":8,"kv_heads":4,"ffn_hidden_dim":1536}
KV head count
Used grouped-query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
Quantization
int8
bits: 8
scope: all
Compression
zlib
level: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
Novel Contributions
- Automated sweep across 11 configurations on a single RTX 4090.
- Found that wider MLPs (3x) outperformed deeper stacking (12 layers) at this parameter budget.
- Demonstrated competitive non-record performance on consumer hardware.
- Used a 3x MLP multiplier with a 9-layer, 512-dimensional GPT-style model.