val_bpb
1.3346
Architecture
NanoGPT
Optimizer
Parallel Muon
Artifact Size
~15.9 MB
Training Techniques
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: {"matrix_lr":0.05,"muon_backend_steps":6,"muon_momentum_warmup_steps":300,"grad_clip_norm":1}
LR Schedule
warmdown
parameters: {"warmdown_iters":900}
Architecture
BigramHash
Hash-based bigram feature component in the architecture.
parameters: {"size":1536}
XSA
Attention-related component applied to the last 4 layers.
parameters: {"layers":4}
Partial RoPE
Rotary positional embeddings applied to a subset of dimensions.
parameters: {"dimensions":"16/64"}
MLP3x
Three-layer MLP stack with LeakyReLU squared activation.
parameters: {"layers":3}
Regularization
layerwise LN scale
parameters: {"formula":"1/sqrt(layer+1)"}
Weight Averaging
EMA + Tight SWA
parameters: {"ema_decay":0.997,"swa_every":50}
Quantization
GPTQ-lite
bits: 6
scope: model weights
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":128}
online n-gram cache eval
parameters: {"ngram_max_n":5,"ngram_lambda":0.15,"confidence_threshold":0.5,"min_count":3}
Other
other
LeakyReLU(0.5)^2 activation in the MLP.
parameters: {"activation":"LeakyReLU(0.5)^2"}
Novel Contributions
- 5-gram eval cache with confidence gating
- Strictly causal online n-gram language model built during evaluation
- Safety-gated log-sum-exp interpolation that only applies n-gram predictions when they improve NLL
- Parallel Muon tuning on baseline NanoGPT
- LeakyReLU squared MLP and other architecture refinements from the base record
- Eval-time improvement with zero GPU cost from CPU-side n-gram lookups