PR #1787

RECORDopen

Record: PR #1736 + Polar Express NS + MIN_LR + Sparse Attn Gate + Fused CE — val_bpb 1.06378

val_bpb

1.0638

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.94 MB

Training Techniques

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"MUON_BACKEND_STEPS":5}

Initialization

OrthoInit

Polar Express Newton-Schulz coefficients used to improve the polar factor produced by Muon's zeropower_via_newtonschulz5.

LR Schedule

warmdown

parameters: {"min_lr":0.1}

Architecture

Gated Attention

Sparse attention head-output gate with narrow gate_window input, replacing dense gated attention while preserving attn_gate_w routing.

parameters: {"gate_window":12,"gate_params_per_layer":96}

Regularization

logit softcap

parameters: {"training_only":true}

Other

other

Fused softcapped cross-entropy Triton kernel for training-time forward/backward efficiency.

parameters: {"training_only":true}

Test-Time Training

score-first TTT

parameters: {"phased":true,"lora":true}

Quantization

int8

bits: 8

scope: gate weights