PR #1180

open

SR-CM-P2Loss: 1.0577 bpb (~15.06MB)

by estesryanView on GitHub
val_bpb
1.0577
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.06MB

Training Techniques

Regularization
logit softcap
parameters: {"value":15}
weight decay
parameters: {"muon":0.012,"adam":0.012}
Quantization
int6
bits: 6
scope: all
late QAT
bits: 6
scope: all
Compression
zstd
level: null
Optimizer
Muon
weight_decay: 0.012
momentum: null
other_params: null
Adam
weight_decay: 0.012
momentum: null
other_params: {"role":"scalar/embed params"}
Architecture
weight tying
Tied input and output embeddings
parameters: null
GQA
Grouped query attention with fewer KV heads than attention heads
parameters: {"heads":8,"kv_heads":4}
ReLU²
Squared activation in the MLP
parameters: null
depthwise Conv1D
Local token mixing before transformer blocks
parameters: null
residual mixing
Learned mixing between current state and initial embedding with per-channel scaling
parameters: null
LR Schedule
warmdown
parameters: {"warmdown_frac":0.35,"wallclock_aligned":true}

Novel Contributions

  • P2 loss ((1-p)^2) for difficulty-aware training
  • Wallclock-aware LR warmdown aligned to the 10-minute cap
  • Residual mixing plus convolutional token mixing
  • Muon optimizer for matrix parameters with Adam for scalar/embed parameters
  • Compression-aware training with int6 quantization and late QAT