PR #845
open12 layers GPT | MLP_MULT reduction | VE and BIGRAM modifications
by rubenbalbastreView on GitHub
val_bpb
1.1407
Architecture
GPT
Optimizer
Parallel Muon
Artifact Size
16MB
Training Techniques
Architecture
GPT depth increase
Increased the model from 11 to 12 layers while reducing parameters elsewhere to stay near the 16MB limit.
parameters: {"layers":12}
MLP_MULT reduction
Reduced MLP width multiplier to free parameters for the extra layer.
parameters: {"mlp_mult":2.6}
Bigram embedding modification
Adjusted bigram vocabulary size and bigram embedding dimension to trade off capacity and parameter count.
parameters: {"bigram_vocab_size":2048,"bigram_dim":256}
Token embedding / VE dimension reduction
Reduced VE dimension to save parameters for the deeper model.
parameters: {"ve_dim":64}
Weight Averaging
EMA
parameters: null
Test-Time Training
LegalTTT
parameters: null
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: null
Novel Contributions
- Extended the baseline GPT from 11 to 12 layers.
- Reduced MLP_MULT to reallocate parameters to depth.
- Modified bigram vocabulary size and bigram embedding dimension.
- Reduced token embedding dimension (VE_DIM) to fit within the 16MB budget.
- Reported multiple parameter trade-off experiments and their validation bpb results.