PR #1334
RECORDopenRecord: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0 — val_bpb 1.0897 (3-seed mean)
by aryanbhosaleView on GitHub
val_bpb
1.0897
Architecture
Transformer
Optimizer
MuonEq-R
Artifact Size
~15.99 MB
Training Techniques
Architecture
MLP4x
4096-vocab model with widened MLP blocks
parameters: {"vocab_size":4096}
depth recurrence
Recurrent reuse of layers to form a deeper virtual network from fewer physical layers
parameters: {"layers":[4,5],"physical_layers":11,"virtual_layers":13}
parallel residuals
Separate attention and MLP residual lanes with learned merge
parameters: {"start_layer":7}
QK-Gain
Scaled query-key gain
parameters: {"gain":5}
Regularization
weight decay
parameters: {"weight_decay":0.09}
Optimizer
MuonEq-R
weight_decay: null
momentum: null
other_params: {"row_normalized":true}
Quantization
GPTQ
bits: 6
scope: full model
Compression
brotli + lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Novel Contributions
- 4096-vocab model with widened MLP blocks and weight decay 0.090
- Depth recurrence on layers 4 and 5 to create a deeper virtual network
- Parallel residuals starting at layer 7 with separate attention and MLP lanes
- MuonEq-R optimizer variant
- QK-Gain 5.0
- Full GPTQ int6 quantization with Brotli and LZMA compressed wrapper
- Fixed-predictor submission with no eval-time adaptation