PR #1296

open

Record: SP4096 + Depth Recurrence + MuonEq-R + Full GPTQ — val_bpb 1.0926 (3-seed mean)

by aryanbhosaleView on GitHub

val_bpb

1.0926

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.98 MB

Training Techniques

Architecture

MLP4x

Wider MLP with 4x expansion

parameters: null

depth recurrence

Virtual deeper network via recurrent reuse of layers

parameters: {"layers":[4,5],"start_step":3000}

LeakyReLU

LeakyReLU squared activation in the MLP

parameters: {"slope":0.5}

XSA

Uses XSA attention variant across all layers

parameters: null

Partial RoPE

Rotary positional encoding applied to a subset of dimensions

parameters: {"dimensions":16}

VE128

VE128 enabled on layers 9-10

parameters: {"layers":[9,10]}

U-Net skip connections

Sigmoid-gated U-Net style skip connections

parameters: null

Regularization

weight decay

parameters: {"weight_decay":0.09}

LN scale

parameters: null

Other

other

MuonEq-R optimizer variant with row-normalized gradients before Newton-Schulz

parameters: null

Quantization

GPTQ

bits: 6

scope: all

Compression

lzma

level: null

brotli

level: 11

Evaluation

sliding window eval

parameters: {"stride":64}

Weight Averaging

EMA

parameters: {"decay":0.997}

Novel Contributions

SP4096 tokenizer with wider MLP and higher weight decay
Depth recurrence on layers 4 and 5 to create a virtual 13-layer network without extra parameters
MuonEq-R optimizer variant with row-normalized gradients
Full GPTQ int6 quantization of all 66 layers
Artifact compression using Brotli and an LZMA self-extracting wrapper