PR #1447

open

Add non-record 16MB submission: FlashMuon LinearScaleInit XSA5LastGated RReLU2 Int6AWQ

by shram86View on GitHub
val_bpb
1.1834
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,361,671 bytes

Training Techniques

Architecture
XSA
XSA enabled on the last 5 layers, with only the final XSA layer gated
parameters: {"layers":5,"gated_last_layer":true}
ReLU²
RReLU2 MLP activation
parameters: null
Initialization
linear scale init
linear-by-depth initialization for attn_scale and mlp_scale
Quantization
int6
bits: 6
scope: AWQ
Compression
lzma
level: null
Weight Averaging
EMA
parameters: {"start":"late"}
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 1024
eval_length: null

Novel Contributions

  • 10-layer SP-1024 model
  • XSA enabled on the last 5 layers with only the final layer gated
  • linear-by-depth initialization for attn_scale and mlp_scale
  • RReLU2 MLP
  • int6 AWQ quantization with lzma export
  • val-tail calibration
  • late EMA with post-train candidate selection
  • stride-64 sliding-window evaluation