PR #1381

open

Cautious Muon + SP4096 + Depth Recurrence — val_bpb 1.1604 (non-record)

by X-Abhishek-XView on GitHub
val_bpb
1.1604
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,170,732 B

Training Techniques

Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"cautious_masking":true,"nesterov":true,"newton_schulz_steps":5,"muoneq_r":true}
Architecture
depth recurrence
Adds virtual layers via recurrence on selected layers.
parameters: {"physical_layers":11,"virtual_layers":13,"recurrence_layers":[4,5],"start_step":3000}
parallel residuals
Uses separate attention and MLP residual lanes with a learnable merge.
parameters: {"start_layer":7}
weight tying
Not mentioned explicitly in the submission.
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.997}
Quantization
GPTQ
bits: 6
scope: full model
Compression
lzma
level: null
Other
other
SP4096 BPE tokenizer.
parameters: null
other
QK-Gain 5.0 per-head query-key scaling.
parameters: {"qk_gain":5}

Novel Contributions

  • Cautious Muon masking applied to Muon updates
  • MuonEq-R row normalization before Newton-Schulz orthogonalization
  • Depth recurrence with 13 virtual layers from 11 physical layers
  • Parallel residual lanes from layer 7
  • SP4096 tokenizer integration
  • Full GPTQ INT6 quantization with Brotli compression
  • EMA 0.997 weight averaging