PR #1474

open

Add non-record 16MB submission: Vocabulary1792 FlashMuon LinearScaleInit XSA5LastGated RReLU2 Int6AWQ MixedBits

by shram86View on GitHub
val_bpb
1.1434
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,859,310 bytes

Training Techniques

Architecture
XSA
Enabled XSA on the last 5 layers, with only the final XSA layer gated.
parameters: {"layers":5,"gated_layers":1}
ReLU²
Used RReLU2 as the MLP activation.
parameters: null
Optimizer
Muon
weight_decay: 0.01
momentum: null
other_params: null
Quantization
mixed int6/int8
bits: 6
scope: most tensors with selected sensitive tensors promoted to int8
Compression
lzma
level: null
Weight Averaging
EMA
parameters: {"late_start":true}
Evaluation
sliding window eval
parameters: {"stride":64}
Initialization
linear scale init
Depth-aware constant initialization for attn_scale and mlp_scale, with stronger scales in later layers.
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"start_progress":0.75}
Regularization
weight decay
parameters: {"value":0.01}
Other
other
Post-train candidate selection among final checkpoint, EMA checkpoint, selected late checkpoints, and their average.
parameters: null

Novel Contributions

  • XSA enabled on the last 5 layers with only the final XSA layer gated
  • RReLU2 MLP activation
  • Depth-aware linear scale initialization for attn_scale and mlp_scale
  • Late EMA with post-train best-checkpoint selection
  • Mixed-bit int6 AWQ export with selected tensors promoted to int8
  • Validation-tail calibration
  • Larger vocabulary (1792) improved the quality/size tradeoff