PR #1260
openRecord: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean)
by dexhunterView on GitHub
val_bpb
1.0929
Architecture
Transformer
Optimizer
MuonEq-R
Artifact Size
~15.96 MB
Training Techniques
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"variant":"MuonEq-R","row_normalization_before_ns":true}
Architecture
depth recurrence
Repeated layers 4 and 5 with fully shared MLP weights during recurrence.
parameters: {"layers":[4,5],"repeat_count":1,"shared_mlp":true}
BigramHash
BigramHash token embedding.
parameters: {"shape":"2816x160"}
XSA
XSA-all-11 attention pattern.
parameters: {"pattern":"all-11"}
MLP4x
4.0x MLP multiplier with sigmoid-gated activation.
parameters: {"multiplier":4}
Quantization
mixed int5/int6 GPTQ
bits: null
scope: all layers
QAT
bits: null
scope: all
Compression
brotli
level: 11
Weight Averaging
EMA
parameters: {"decay":0.997}
Regularization
weight decay
parameters: {"value":0.085}
Novel Contributions
- MuonEq-R optimizer with row normalization before Newton-Schulz orthogonalization
- Depth recurrence on layers 4 and 5 with fully shared MLP weights
- Mixed int5/int6 GPTQ using Hessian sensitivity ranking
- 3-seed mean validation score of 1.0929 bpb under the 16MB limit