PR #1612

open

Non-record: MLX tuned hyperparameters — 1.5096 BPB local (H100 pending)

by seekerPriceView on GitHub

val_bpb

1.5096

Architecture

Transformer

Optimizer

Muon

Artifact Size

12.65 MB

Training Techniques

Optimizer

Muon

weight_decay: null

momentum: 0.95

other_params: {"matrix_lr":0.02,"muon_momentum_warmup_start":0.9,"muon_momentum_warmup_steps":60,"tied_embed_lr":0.03,"scalar_lr":0.02,"grad_clip_norm":0.3}

Architecture

weight tying

Tied embeddings are used in the model stack.

parameters: null

depth recurrence

Model uses depth recurrence.

parameters: {"layers":[3,4,5]}

Partial RoPE

Only part of the head dimensions are rotated.

parameters: {"dimensions":16}

Compression

Brotli

level: null

Quantization

int6

bits: 6

scope: model artifact

Novel Contributions

Pure hyperparameter tuning improved local validation by 0.0500 BPB at the same architecture and training config.
Tuned Matrix LR, Muon momentum, Muon momentum warmup start, and QK-Gain for small-batch MLX training.
Empirical A/B validation on MLX with 5000-step runs comparing EXP-042 and EXP-048.
Use of 3-AI theoretical recommendations (Claude, Gemini, Codex) followed by local experimental confirmation.