PR #1901

open

Record: 0.8335 BPB — DualHash + AdaMuon + MoE + SDClip (3-seed mean)

val_bpb

0.8335

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.13 MiB

Training Techniques

Architecture

BigramHash

Dual-token hash skip connection using two hash tables for bigram-style skip features.

parameters: {"tables":2,"table_shape":"2048x16","multipliers":[8191,104729]}

depth recurrence

Recurrent layer structure with a repeated loop over layers and learnable LayerScale coefficients.

parameters: {"pattern":[0,1,2,3,4,5,3,4,5]}

MoE

Hybrid mixture-of-experts with shared and specialized experts; top-1 routing plus shared expert output.

parameters: {"shared_experts":1,"specialized_experts":3,"top_k":1}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"variant":"AdaMuon","rms_preconditioning":true,"riemannian_newton_schulz_orthogonalization":true}

Quantization

int6

bits: 6

scope: artifact export

Test-Time Training

score-first TTT

parameters: {"passes":2}

Compression

lzma

level: null

Regularization

layerwise LN scale

parameters: {"learnable_layerscale":true,"main_branch_init":1,"recurrent_branch_init":0.1}