PR #1180

open

SR-CM-P2Loss: 1.0577 bpb (~15.06MB)

by estesryanView on GitHub

val_bpb

1.0577

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.06MB

Training Techniques

Regularization

logit softcap

parameters: {"value":15}

weight decay

parameters: {"muon":0.012,"adam":0.012}

Quantization

int6

bits: 6

scope: all

late QAT

bits: 6

scope: all

Compression

zstd

level: null

Optimizer

Muon

weight_decay: 0.012

momentum: null

other_params: null

Adam

weight_decay: 0.012

momentum: null

other_params: {"role":"scalar/embed params"}

Architecture

weight tying

Tied input and output embeddings

parameters: null

GQA

Grouped query attention with fewer KV heads than attention heads

parameters: {"heads":8,"kv_heads":4}

ReLU²

Squared activation in the MLP

parameters: null

depthwise Conv1D

Local token mixing before transformer blocks

parameters: null

residual mixing

Learned mixing between current state and initial embedding with per-channel scaling

parameters: null

LR Schedule

warmdown

parameters: {"warmdown_frac":0.35,"wallclock_aligned":true}

Novel Contributions

P2 loss ((1-p)^2) for difficulty-aware training
Wallclock-aware LR warmdown aligned to the 10-minute cap
Residual mixing plus convolutional token mixing
Muon optimizer for matrix parameters with Adam for scalar/embed parameters
Compression-aware training with int6 quantization and late QAT