PR #1290

open

Record: Depth Recurrence + MuonEq-R + AR Self-Gen GPTQ — val_bpb 1.1104 (3-seed mean)

by aryanbhosaleView on GitHub
val_bpb
1.1104
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.97 MB

Training Techniques

Architecture
depth recurrence
Layers 4 and 5 are repeated during the forward pass using the same physical parameters, creating a virtual deeper network.
parameters: {"layers":[4,5],"virtual_layers":13,"physical_layers":11,"start_step":3000}
XSA
XSA attention is used across all 11 layers.
parameters: {"layers":11}
BigramHash
BigramHash embedding/feature mechanism used in the model.
parameters: {"vocab_size":3072,"dim":112}
LeakyReLU
LeakyReLU squared MLP activation.
parameters: {"squared":true}
U-Net skip connections
Encoder-decoder style skip connections in the network.
parameters: null
Optimizer
Muon
weight_decay: 0.04
momentum: 0.92
other_params: {"row_normalized":true,"newton_schulz_steps":5,"parallel":true}
Quantization
GPTQ
bits: 6
scope: all
STE QAT
bits: null
scope: final 15% wallclock
Weight Averaging
EMA + SWA
parameters: {"ema_decay":0.997}
Compression
lzma
level: 9
Evaluation
sliding window eval
parameters: null
Regularization
LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
magnitude pruning
parameters: {"type":"selective +/-1","criterion":"reconstruction error"}
LR Schedule
warmdown
parameters: {"warmdown_steps":4000}

Novel Contributions

  • Depth recurrence that repeats layers 4 and 5 to create a virtual 13-layer network from an 11-layer parameter budget
  • MuonEq-R row-normalized Muon optimization before Newton-Schulz orthogonalization
  • Autoregressive self-generated full-Hessian GPTQ calibration without external data
  • Combination of depth recurrence, MuonEq-R, and AR self-generated GPTQ achieving 1.1104 val_bpb