PR #1334

RECORDopen

Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0 — val_bpb 1.0897 (3-seed mean)

by aryanbhosaleView on GitHub
val_bpb
1.0897
Architecture
Transformer
Optimizer
MuonEq-R
Artifact Size
~15.99 MB

Training Techniques

Architecture
MLP4x
4096-vocab model with widened MLP blocks
parameters: {"vocab_size":4096}
depth recurrence
Recurrent reuse of layers to form a deeper virtual network from fewer physical layers
parameters: {"layers":[4,5],"physical_layers":11,"virtual_layers":13}
parallel residuals
Separate attention and MLP residual lanes with learned merge
parameters: {"start_layer":7}
QK-Gain
Scaled query-key gain
parameters: {"gain":5}
Regularization
weight decay
parameters: {"weight_decay":0.09}
Optimizer
MuonEq-R
weight_decay: null
momentum: null
other_params: {"row_normalized":true}
Quantization
GPTQ
bits: 6
scope: full model
Compression
brotli + lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":64}

Novel Contributions

  • 4096-vocab model with widened MLP blocks and weight decay 0.090
  • Depth recurrence on layers 4 and 5 to create a deeper virtual network
  • Parallel residuals starting at layer 7 with separate attention and MLP lanes
  • MuonEq-R optimizer variant
  • QK-Gain 5.0
  • Full GPTQ int6 quantization with Brotli and LZMA compressed wrapper
  • Fixed-predictor submission with no eval-time adaptation