PR #1279

open

Record: MuonEq-R + Depth Recurrence + N61 Mixed GPTQ — val_bpb 1.0924 (3-seed mean)

val_bpb

1.0924

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.98 MB

Training Techniques

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"variant":"MuonEq-R","row_normalization_before_ns5":true}

Architecture

depth recurrence

Repeats layers 4 and 5 once after the initial forward pass with fully shared MLP weights.

parameters: {"layers":[4,5],"repeat_count":1}

BigramHash

BigramHash token embedding.

parameters: {"dimensions":[2816,160]}

GQA

Uses 8 attention heads with 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

Quantization

GPTQ

bits: null

scope: all

mixed int6/int5

bits: null

scope: all

Weight Averaging

EMA

parameters: {"decay":0.997}

Regularization

weight decay

parameters: {"value":0.085}

Evaluation

sliding window eval

parameters: null

Compression

brotli

level: 11

MuonEq-R optimizer variant with row normalization before Newton-Schulz orthogonalization
Depth recurrence on layers 4 and 5 with fully shared MLP weights
Mixed GPTQ quantization using 61 int6 layers and 5 int5 layers
Smaller self-extracting mini runner to free artifact budget for one additional int6 layer
Three-seed verified record submission under the 16MB artifact limit