PR #1381
openCautious Muon + SP4096 + Depth Recurrence — val_bpb 1.1604 (non-record)
by X-Abhishek-XView on GitHub
val_bpb
1.1604
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,170,732 B
Training Techniques
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"cautious_masking":true,"nesterov":true,"newton_schulz_steps":5,"muoneq_r":true}
Architecture
depth recurrence
Adds virtual layers via recurrence on selected layers.
parameters: {"physical_layers":11,"virtual_layers":13,"recurrence_layers":[4,5],"start_step":3000}
parallel residuals
Uses separate attention and MLP residual lanes with a learnable merge.
parameters: {"start_layer":7}
weight tying
Not mentioned explicitly in the submission.
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.997}
Quantization
GPTQ
bits: 6
scope: full model
Compression
lzma
level: null
Other
other
SP4096 BPE tokenizer.
parameters: null
other
QK-Gain 5.0 per-head query-key scaling.
parameters: {"qk_gain":5}
Novel Contributions
- Cautious Muon masking applied to Muon updates
- MuonEq-R row normalization before Newton-Schulz orthogonalization
- Depth recurrence with 13 virtual layers from 11 physical layers
- Parallel residual lanes from layer 7
- SP4096 tokenizer integration
- Full GPTQ INT6 quantization with Brotli compression
- EMA 0.997 weight averaging