val_bpb
1.5096
Architecture
Transformer
Optimizer
Muon
Artifact Size
—
Training Techniques
Architecture
depth recurrence
Reuses selected physical layers as virtual layers via a virtual-to-physical mapping.
parameters: {"layers":[3,4,5],"start_step":1500}
parallel residuals
GPT-J style parallel attention and MLP branches from the same pre-residual input.
parameters: {"start_layer":7}
Optimizer
Muon
weight_decay: null
momentum: 0.95
other_params: {"matrix_lr":0.02}
Regularization
weight decay
parameters: null
Novel Contributions
- CUDA port of the PR #1612 recipe
- Depth recurrence implemented with env-var-controlled virtual-to-physical layer mapping
- Parallel residuals implemented as an opt-in GPT-J style block modification
- Tuned hyperparameters transferred from the MLX companion submission
- Backwards-compatible design where default behavior matches upstream train_gpt.py