PR #1474

open

Add non-record 16MB submission: Vocabulary1792 FlashMuon LinearScaleInit XSA5LastGated RReLU2 Int6AWQ MixedBits

by shram86View on GitHub

val_bpb

1.1434

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,859,310 bytes

Training Techniques

Architecture

XSA

Enabled XSA on the last 5 layers, with only the final XSA layer gated.

parameters: {"layers":5,"gated_layers":1}

ReLU²

Used RReLU2 as the MLP activation.

parameters: null

Optimizer

Muon

weight_decay: 0.01

momentum: null

other_params: null

Quantization

mixed int6/int8

bits: 6

scope: most tensors with selected sensitive tensors promoted to int8

Compression

lzma

level: null

Weight Averaging

EMA

parameters: {"late_start":true}

Evaluation

sliding window eval

parameters: {"stride":64}

Initialization

linear scale init

Depth-aware constant initialization for attn_scale and mlp_scale, with stronger scales in later layers.

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmdown

parameters: {"start_progress":0.75}

Regularization

weight decay

parameters: {"value":0.01}

Other

other

Post-train candidate selection among final checkpoint, EMA checkpoint, selected late checkpoints, and their average.

parameters: null

Novel Contributions

XSA enabled on the last 5 layers with only the final XSA layer gated
RReLU2 MLP activation
Depth-aware linear scale initialization for attn_scale and mlp_scale
Late EMA with post-train best-checkpoint selection
Mixed-bit int6 AWQ export with selected tensors promoted to int8
Validation-tail calibration
Larger vocabulary (1792) improved the quality/size tradeoff