PR #1614

open

Non-record: CUDA port of PR #1612 recipe (H100 pending)

by seekerPriceView on GitHub
val_bpb
1.5096
Architecture
Transformer
Optimizer
Muon
Artifact Size

Training Techniques

Architecture
depth recurrence
Reuses selected physical layers as virtual layers via a virtual-to-physical mapping.
parameters: {"layers":[3,4,5],"start_step":1500}
parallel residuals
GPT-J style parallel attention and MLP branches from the same pre-residual input.
parameters: {"start_layer":7}
Optimizer
Muon
weight_decay: null
momentum: 0.95
other_params: {"matrix_lr":0.02}
Regularization
weight decay
parameters: null

Novel Contributions

  • CUDA port of the PR #1612 recipe
  • Depth recurrence implemented with env-var-controlled virtual-to-physical layer mapping
  • Parallel residuals implemented as an opt-in GPT-J style block modification
  • Tuned hyperparameters transferred from the MLX companion submission
  • Backwards-compatible design where default behavior matches upstream train_gpt.py