PR #1496

open

Restore non-record submission: 2026-04-08 Vocab1792 FlashMuon LinearScaleInit XSA5LastGated RReLU2 Int6AWQ

by shram86View on GitHub
val_bpb
1.1920
Architecture
Transformer
Optimizer
Muon
Artifact Size
under 16 MB

Training Techniques

Architecture
XSA
Enabled XSA on the last 5 layers, with only the final XSA layer gated.
parameters: {"layers":5,"last_gated":true}
ReLU²
Used RReLU2 as the MLP activation.
parameters: null
Optimizer
Muon
weight_decay: 0.01
momentum: null
other_params: null
Quantization
int6_awq
bits: 6
scope: weights
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Weight Averaging
EMA
parameters: {"start":"late"}
Initialization
linear phase initialization
Used simple linear phase initialization for residual / phase setup.
depth-aware constant scale init
Initialized later layers with stronger constant attn_scale and mlp_scale values.
parameters: {"attn_scale":{"early":1,"mid":1.75,"late":2.5},"mlp_scale":{"early":1,"mid":1.15,"late":1.3}}
LR Schedule
warmdown
parameters: {"start_progress":0.75,"progress_based":true}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Regularization
weight decay
parameters: {"value":0.01}

Novel Contributions

  • XSA on the last 5 layers with only the final XSA layer gated
  • RReLU2 MLP activation
  • int6 AWQ quantization with lzma compression
  • val_tail calibration for quantization
  • late-start EMA with post-train candidate selection
  • depth-aware constant initialization for attn_scale and mlp_scale
  • progress-based warmdown starting around 75% of training