PR #1260

open

Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean)

by dexhunterView on GitHub
val_bpb
1.0929
Architecture
Transformer
Optimizer
MuonEq-R
Artifact Size
~15.96 MB

Training Techniques

Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"variant":"MuonEq-R","row_normalization_before_ns":true}
Architecture
depth recurrence
Repeated layers 4 and 5 with fully shared MLP weights during recurrence.
parameters: {"layers":[4,5],"repeat_count":1,"shared_mlp":true}
BigramHash
BigramHash token embedding.
parameters: {"shape":"2816x160"}
XSA
XSA-all-11 attention pattern.
parameters: {"pattern":"all-11"}
MLP4x
4.0x MLP multiplier with sigmoid-gated activation.
parameters: {"multiplier":4}
Quantization
mixed int5/int6 GPTQ
bits: null
scope: all layers
QAT
bits: null
scope: all
Compression
brotli
level: 11
Weight Averaging
EMA
parameters: {"decay":0.997}
Regularization
weight decay
parameters: {"value":0.085}

Novel Contributions

  • MuonEq-R optimizer with row normalization before Newton-Schulz orthogonalization
  • Depth recurrence on layers 4 and 5 with fully shared MLP weights
  • Mixed int5/int6 GPTQ using Hessian sensitivity ranking
  • 3-seed mean validation score of 1.0929 bpb under the 16MB limit