PR #1447
openAdd non-record 16MB submission: FlashMuon LinearScaleInit XSA5LastGated RReLU2 Int6AWQ
by shram86View on GitHub
val_bpb
1.1834
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,361,671 bytes
Training Techniques
Architecture
XSA
XSA enabled on the last 5 layers, with only the final XSA layer gated
parameters: {"layers":5,"gated_last_layer":true}
ReLU²
RReLU2 MLP activation
parameters: null
Initialization
linear scale init
linear-by-depth initialization for attn_scale and mlp_scale
Quantization
int6
bits: 6
scope: AWQ
Compression
lzma
level: null
Weight Averaging
EMA
parameters: {"start":"late"}
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Novel Contributions
- 10-layer SP-1024 model
- XSA enabled on the last 5 layers with only the final layer gated
- linear-by-depth initialization for attn_scale and mlp_scale
- RReLU2 MLP
- int6 AWQ quantization with lzma export
- val-tail calibration
- late EMA with post-train candidate selection
- stride-64 sliding-window evaluation