PR #1260

open

Record: MuonEq-R + Depth Recurrence + Mixed Int5/Int6 GPTQ — val_bpb 1.0929 (3-seed mean)

val_bpb

1.0929

Architecture

Transformer

Optimizer

MuonEq-R

Artifact Size

~15.96 MB

Training Techniques

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"variant":"MuonEq-R","row_normalization_before_ns":true}

Architecture

depth recurrence

Repeated layers 4 and 5 with fully shared MLP weights during recurrence.

parameters: {"layers":[4,5],"repeat_count":1,"shared_mlp":true}

BigramHash

BigramHash token embedding.

parameters: {"shape":"2816x160"}

XSA

XSA-all-11 attention pattern.

parameters: {"pattern":"all-11"}

MLP4x

4.0x MLP multiplier with sigmoid-gated activation.

parameters: {"multiplier":4}

Quantization

mixed int5/int6 GPTQ

bits: null

scope: all layers

QAT

bits: null

scope: all

Compression

brotli

level: 11

Weight Averaging

EMA

parameters: {"decay":0.997}

Regularization

weight decay

parameters: {"value":0.085}

MuonEq-R optimizer with row normalization before Newton-Schulz orthogonalization
Depth recurrence on layers 4 and 5 with fully shared MLP weights
Mixed int5/int6 GPTQ using Hessian sensitivity ranking
3-seed mean validation score of 1.0929 bpb under the 16MB limit