PR #1495
openAdd non-record submission: 12L 24min Vocab1792 FlashMuon LinearScaleInit XSA5LastGated RReLU2 Int6AWQ MixedBits
by shram86View on GitHub
val_bpb
1.1077
Architecture
Transformer
Optimizer
Muon
Artifact Size
18,629,446 bytes
Training Techniques
Architecture
XSA
Enabled XSA on the last 5 layers, with only the final XSA layer gated.
parameters: {"layers":5,"last_gated":1}
ReLU²
Used RReLU2 MLP activation.
parameters: null
Optimizer
Muon
weight_decay: 0.01
momentum: null
other_params: null
Quantization
mixed int6/int8
bits: 6
scope: most tensors with selected sensitive tensors promoted to int8
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Weight Averaging
EMA
parameters: {"start":"late"}
Initialization
linear scale init
Depth-aware constant initialization for attn_scale and mlp_scale, with stronger scales in later layers.
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"start_progress":0.75}
Regularization
weight decay
parameters: {"adamw_decay":0.01}
Novel Contributions
- XSA enabled only on the last 5 layers with only the final XSA layer gated
- RReLU2 MLP activation for this branch
- Depth-aware linear scale initialization for attn_scale and mlp_scale
- Late-start EMA with post-train candidate selection
- Mixed-bit int6 AWQ export with selected tensors promoted to int8
- Validation-tail calibration and sliding-window evaluation