PR #1296

open

Record: SP4096 + Depth Recurrence + MuonEq-R + Full GPTQ — val_bpb 1.0926 (3-seed mean)

by aryanbhosaleView on GitHub
val_bpb
1.0926
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.98 MB

Training Techniques

Architecture
MLP4x
Wider MLP with 4x expansion
parameters: null
depth recurrence
Virtual deeper network via recurrent reuse of layers
parameters: {"layers":[4,5],"start_step":3000}
LeakyReLU
LeakyReLU squared activation in the MLP
parameters: {"slope":0.5}
XSA
Uses XSA attention variant across all layers
parameters: null
Partial RoPE
Rotary positional encoding applied to a subset of dimensions
parameters: {"dimensions":16}
VE128
VE128 enabled on layers 9-10
parameters: {"layers":[9,10]}
U-Net skip connections
Sigmoid-gated U-Net style skip connections
parameters: null
Regularization
weight decay
parameters: {"weight_decay":0.09}
LN scale
parameters: null
Other
other
MuonEq-R optimizer variant with row-normalized gradients before Newton-Schulz
parameters: null
Quantization
GPTQ
bits: 6
scope: all
Compression
lzma
level: null
brotli
level: 11
Evaluation
sliding window eval
parameters: {"stride":64}
Weight Averaging
EMA
parameters: {"decay":0.997}

Novel Contributions

  • SP4096 tokenizer with wider MLP and higher weight decay
  • Depth recurrence on layers 4 and 5 to create a virtual 13-layer network without extra parameters
  • MuonEq-R optimizer variant with row-normalized gradients
  • Full GPTQ int6 quantization of all 66 layers
  • Artifact compression using Brotli and an LZMA self-extracting wrapper