PR #1787

RECORDopen

Record: PR #1736 + Polar Express NS + MIN_LR + Sparse Attn Gate + Fused CE — val_bpb 1.06378

by nprime06View on GitHub
val_bpb
1.0638
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.94 MB

Training Techniques

Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"MUON_BACKEND_STEPS":5}
Initialization
OrthoInit
Polar Express Newton-Schulz coefficients used to improve the polar factor produced by Muon's zeropower_via_newtonschulz5.
LR Schedule
warmdown
parameters: {"min_lr":0.1}
Architecture
Gated Attention
Sparse attention head-output gate with narrow gate_window input, replacing dense gated attention while preserving attn_gate_w routing.
parameters: {"gate_window":12,"gate_params_per_layer":96}
Regularization
logit softcap
parameters: {"training_only":true}
Other
other
Fused softcapped cross-entropy Triton kernel for training-time forward/backward efficiency.
parameters: {"training_only":true}
Test-Time Training
score-first TTT
parameters: {"phased":true,"lora":true}
Quantization
int8
bits: 8
scope: gate weights

Novel Contributions

  • Polar Express Newton-Schulz coefficients ported from PR #1344
  • MIN_LR=0.10 warmdown floor
  • Sparse attention head-output gate with much smaller parameter footprint
  • Fused softcapped cross-entropy Triton kernel
  • TTT path mirroring fix for sparse gate consistency
  • BOS-fix patch for prepare_caseops_data.py