PR #1614

open

Non-record: CUDA port of PR #1612 recipe (H100 pending)

val_bpb

1.5096

Architecture

Transformer

Optimizer

Muon

Artifact Size

—

Training Techniques

Architecture

depth recurrence

Reuses selected physical layers as virtual layers via a virtual-to-physical mapping.

parameters: {"layers":[3,4,5],"start_step":1500}

parallel residuals

GPT-J style parallel attention and MLP branches from the same pre-residual input.

parameters: {"start_layer":7}

Optimizer

Muon

weight_decay: null

momentum: 0.95

other_params: {"matrix_lr":0.02}

Regularization

weight decay

parameters: null

CUDA port of the PR #1612 recipe
Depth recurrence implemented with env-var-controlled virtual-to-physical layer mapping
Parallel residuals implemented as an opt-in GPT-J style block modification
Tuned hyperparameters transferred from the MLX companion submission
Backwards-compatible design where default behavior matches upstream train_gpt.py