PR #1279

open

Record: MuonEq-R + Depth Recurrence + N61 Mixed GPTQ — val_bpb 1.0924 (3-seed mean)

by dexhunterView on GitHub
val_bpb
1.0924
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.98 MB

Training Techniques

Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"variant":"MuonEq-R","row_normalization_before_ns5":true}
Architecture
depth recurrence
Repeats layers 4 and 5 once after the initial forward pass with fully shared MLP weights.
parameters: {"layers":[4,5],"repeat_count":1}
BigramHash
BigramHash token embedding.
parameters: {"dimensions":[2816,160]}
GQA
Uses 8 attention heads with 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
Quantization
GPTQ
bits: null
scope: all
mixed int6/int5
bits: null
scope: all
Weight Averaging
EMA
parameters: {"decay":0.997}
Regularization
weight decay
parameters: {"value":0.085}
Evaluation
sliding window eval
parameters: null
Compression
brotli
level: 11

Novel Contributions

  • MuonEq-R optimizer variant with row normalization before Newton-Schulz orthogonalization
  • Depth recurrence on layers 4 and 5 with fully shared MLP weights
  • Mixed GPTQ quantization using 61 int6 layers and 5 int5 layers
  • Smaller self-extracting mini runner to free artifact budget for one additional int6 layer
  • Three-seed verified record submission under the 16MB artifact limit