PR #1612

open

Non-record: MLX tuned hyperparameters — 1.5096 BPB local (H100 pending)

by seekerPriceView on GitHub
val_bpb
1.5096
Architecture
Transformer
Optimizer
Muon
Artifact Size
12.65 MB

Training Techniques

Optimizer
Muon
weight_decay: null
momentum: 0.95
other_params: {"matrix_lr":0.02,"muon_momentum_warmup_start":0.9,"muon_momentum_warmup_steps":60,"tied_embed_lr":0.03,"scalar_lr":0.02,"grad_clip_norm":0.3}
Architecture
weight tying
Tied embeddings are used in the model stack.
parameters: null
depth recurrence
Model uses depth recurrence.
parameters: {"layers":[3,4,5]}
Partial RoPE
Only part of the head dimensions are rotated.
parameters: {"dimensions":16}
Compression
Brotli
level: null
Quantization
int6
bits: 6
scope: model artifact

Novel Contributions

  • Pure hyperparameter tuning improved local validation by 0.0500 BPB at the same architecture and training config.
  • Tuned Matrix LR, Muon momentum, Muon momentum warmup start, and QK-Gain for small-batch MLX training.
  • Empirical A/B validation on MLX with 5000-step runs comparing EXP-042 and EXP-048.
  • Use of 3-AI theoretical recommendations (Claude, Gemini, Codex) followed by local experimental confirmation.