PR #1279
openRecord: MuonEq-R + Depth Recurrence + N61 Mixed GPTQ — val_bpb 1.0924 (3-seed mean)
by dexhunterView on GitHub
val_bpb
1.0924
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.98 MB
Training Techniques
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"variant":"MuonEq-R","row_normalization_before_ns5":true}
Architecture
depth recurrence
Repeats layers 4 and 5 once after the initial forward pass with fully shared MLP weights.
parameters: {"layers":[4,5],"repeat_count":1}
BigramHash
BigramHash token embedding.
parameters: {"dimensions":[2816,160]}
GQA
Uses 8 attention heads with 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
Quantization
GPTQ
bits: null
scope: all
mixed int6/int5
bits: null
scope: all
Weight Averaging
EMA
parameters: {"decay":0.997}
Regularization
weight decay
parameters: {"value":0.085}
Evaluation
sliding window eval
parameters: null
Compression
brotli
level: 11
Novel Contributions
- MuonEq-R optimizer variant with row normalization before Newton-Schulz orthogonalization
- Depth recurrence on layers 4 and 5 with fully shared MLP weights
- Mixed GPTQ quantization using 61 int6 layers and 5 int5 layers
- Smaller self-extracting mini runner to free artifact budget for one additional int6 layer
- Three-seed verified record submission under the 16MB artifact limit