PR #1290

open

Record: Depth Recurrence + MuonEq-R + AR Self-Gen GPTQ — val_bpb 1.1104 (3-seed mean)

by aryanbhosaleView on GitHub

val_bpb

1.1104

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.97 MB

Training Techniques

Architecture

depth recurrence

Layers 4 and 5 are repeated during the forward pass using the same physical parameters, creating a virtual deeper network.

parameters: {"layers":[4,5],"virtual_layers":13,"physical_layers":11,"start_step":3000}

XSA

XSA attention is used across all 11 layers.

parameters: {"layers":11}

BigramHash

BigramHash embedding/feature mechanism used in the model.

parameters: {"vocab_size":3072,"dim":112}

LeakyReLU

LeakyReLU squared MLP activation.

parameters: {"squared":true}

U-Net skip connections

Encoder-decoder style skip connections in the network.

parameters: null

Optimizer

Muon

weight_decay: 0.04

momentum: 0.92

other_params: {"row_normalized":true,"newton_schulz_steps":5,"parallel":true}

Quantization

GPTQ

bits: 6

scope: all

STE QAT

bits: null

scope: final 15% wallclock

Weight Averaging

EMA + SWA

parameters: {"ema_decay":0.997}

Compression

lzma

level: 9

Evaluation

sliding window eval

parameters: null

Regularization

LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

magnitude pruning

parameters: {"type":"selective +/-1","criterion":"reconstruction error"}

LR Schedule

warmdown

parameters: {"warmdown_steps":4000}

Novel Contributions

Depth recurrence that repeats layers 4 and 5 to create a virtual 13-layer network from an 11-layer parameter budget
MuonEq-R row-normalized Muon optimization before Newton-Schulz orthogonalization
Autoregressive self-generated full-Hessian GPTQ calibration without external data
Combination of depth recurrence, MuonEq-R, and AR self-generated GPTQ achieving 1.1104 val_bpb