PR #1496

open

Restore non-record submission: 2026-04-08 Vocab1792 FlashMuon LinearScaleInit XSA5LastGated RReLU2 Int6AWQ

by shram86View on GitHub

val_bpb

1.1920

Architecture

Transformer

Optimizer

Muon

Artifact Size

under 16 MB

Training Techniques

Architecture

XSA

Enabled XSA on the last 5 layers, with only the final XSA layer gated.

parameters: {"layers":5,"last_gated":true}

ReLU²

Used RReLU2 as the MLP activation.

parameters: null

Optimizer

Muon

weight_decay: 0.01

momentum: null

other_params: null

Quantization

int6_awq

bits: 6

scope: weights

Compression

lzma

level: null

Evaluation

sliding window eval

parameters: {"stride":64}

Weight Averaging

EMA

parameters: {"start":"late"}

Initialization

linear phase initialization

Used simple linear phase initialization for residual / phase setup.

depth-aware constant scale init

Initialized later layers with stronger constant attn_scale and mlp_scale values.

parameters: {"attn_scale":{"early":1,"mid":1.75,"late":2.5},"mlp_scale":{"early":1,"mid":1.15,"late":1.3}}

LR Schedule

warmdown

parameters: {"start_progress":0.75,"progress_based":true}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Regularization

weight decay

parameters: {"value":0.01}

Novel Contributions

XSA on the last 5 layers with only the final XSA layer gated
RReLU2 MLP activation
int6 AWQ quantization with lzma compression
val_tail calibration for quantization
late-start EMA with post-train candidate selection
depth-aware constant initialization for attn_scale and mlp_scale
progress-based warmdown starting around 75% of training