PR #794

open

Muon Optimizer Tuning: val_bpb 1.3346 by jeremyschied

by jeremyschiedView on GitHub

val_bpb

1.3346

Architecture

NanoGPT

Optimizer

Parallel Muon

Artifact Size

~15.9 MB

Training Techniques

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: {"matrix_lr":0.05,"muon_backend_steps":6,"muon_momentum_warmup_steps":300,"grad_clip_norm":1}

LR Schedule

warmdown

parameters: {"warmdown_iters":900}

Architecture

BigramHash

Hash-based bigram feature component in the architecture.

parameters: {"size":1536}

XSA

Attention-related component applied to the last 4 layers.

parameters: {"layers":4}

Partial RoPE

Rotary positional embeddings applied to a subset of dimensions.

parameters: {"dimensions":"16/64"}

MLP3x

Three-layer MLP stack with LeakyReLU squared activation.

parameters: {"layers":3}

Regularization

layerwise LN scale

parameters: {"formula":"1/sqrt(layer+1)"}

Weight Averaging

EMA + Tight SWA

parameters: {"ema_decay":0.997,"swa_every":50}

Quantization

GPTQ-lite

bits: 6

scope: model weights

Compression

lzma

level: null

Evaluation

sliding window eval

parameters: {"stride":128}

online n-gram cache eval

parameters: {"ngram_max_n":5,"ngram_lambda":0.15,"confidence_threshold":0.5,"min_count":3}

Other

other

LeakyReLU(0.5)^2 activation in the MLP.

parameters: {"activation":"LeakyReLU(0.5)^2"}

Novel Contributions

5-gram eval cache with confidence gating
Strictly causal online n-gram language model built during evaluation
Safety-gated log-sum-exp interpolation that only applies n-gram predictions when they improve NLL
Parallel Muon tuning on baseline NanoGPT
LeakyReLU squared MLP and other architecture refinements from the base record
Eval-time improvement with zero GPU cost from CPU-side n-gram lookups