PR #703

open

Record: PR549 + MiLe decay + 8-bit Muon + 1.04x LR + Cache+Backout — val_bpb 1.1176

by GusanidasView on GitHub
val_bpb
1.1176
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
~15.95 MB

Training Techniques

Quantization
GPTQ
bits: 6
scope: all weights
int8
bits: 8
scope: Muon momentum buffers
Optimizer
Parallel Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":1500,"warmdown_iters":3500}
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"every":50}
Architecture
LeakyReLU²
MLP activation uses LeakyReLU(0.5) squared
parameters: {"slope":0.5}
BigramHash
BigramHash embedding/component used in the model
parameters: {"size":1536}
XSA
XSA applied to the last 4 layers
parameters: {"layers":4}
Partial RoPE
Rotary positional embeddings applied partially
parameters: {"dimensions":16}
LN Scale
LayerNorm scale set to 1/sqrt(layer+1)
parameters: null
VE128
VE enabled in layers 9-10 with dimension 128
parameters: {"layers":[9,10],"dimension":128}
Cache+Backout
Caches hidden states after layer 7; later attention reads from cached clean context and applies a learned backout term
parameters: {"cache_after_layer":7,"backout_init":0.1}
Other
other
MiLe loss with entropy-weighted token loss and decay to standard cross-entropy during warmdown
parameters: {"gamma":1.1}
LR Schedule
warmdown
parameters: {"warmdown_iters":3500,"lr_multiplier":1.04}
Regularization
layerwise LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
Evaluation
sliding window eval
parameters: {"stride":64}

Novel Contributions

  • MiLe loss with entropy-weighted token loss and decay during warmdown
  • 8-bit blockwise symmetric int8 quantization of Muon momentum buffers
  • 1.04x learning-rate boost
  • Cache+Backout mechanism using cached hidden states after layer 7 and learned backout scalar
  • Full Hessian GPTQ with Hessian-based column ordering and Cholesky error compensation
  • GPTQ quantization adapted for banked weights via temporary unbanked model Hessian collection