PR #1447

open

Add non-record 16MB submission: FlashMuon LinearScaleInit XSA5LastGated RReLU2 Int6AWQ

val_bpb

1.1834

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,361,671 bytes

Training Techniques

Architecture

XSA

XSA enabled on the last 5 layers, with only the final XSA layer gated

parameters: {"layers":5,"gated_last_layer":true}

ReLU²

RReLU2 MLP activation

parameters: null

Initialization

linear scale init

linear-by-depth initialization for attn_scale and mlp_scale

Quantization

int6

bits: 6

scope: AWQ

Compression

lzma

level: null

Weight Averaging

EMA

parameters: {"start":"late"}

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: 1024

eval_length: null