PR #1285

RECORDopen

Record: MuonEq-R + Depth Recurrence + WD=0.090 + All-Int6 GPTQ — val_bpb 1.0912 (3-seed mean)

by dexhunterView on GitHub
val_bpb
1.0912
Architecture
Transformer
Optimizer
MuonEq-R
Artifact Size
~15.96 MB

Training Techniques

Optimizer
Muon
weight_decay: 0.09
momentum: null
other_params: {"variant":"MuonEq-R","row_normalization":true}
Architecture
depth recurrence
Layers 4 and 5 are repeated with fully shared MLP.
parameters: {"layers":[4,5]}
BigramHash
BigramHash token embedding.
parameters: {"dimensions":[2816,160]}
Quantization
GPTQ
bits: 6
scope: all
Regularization
weight decay
parameters: {"weight_decay":0.09}
Weight Averaging
EMA
parameters: {"decay":0.997}
Evaluation
sliding window eval
parameters: null
Compression
brotli
level: 11

Novel Contributions

  • MuonEq-R optimizer variant with row normalization
  • Depth recurrence on layers 4 and 5
  • Higher weight decay enabling better compression headroom
  • All 66 layers quantized to int6 with GPTQ
  • WD-quantization synergy to improve artifact efficiency