PR #1125

open

Non-record: XSA-All + QK Gain 4.0 + LN Scale — 45 Experiments on 1×RTX 5090

by jainpranjal97View on GitHub

val_bpb

1.1946

Architecture

Transformer

Optimizer

Muon

Artifact Size

18.1 MB

Training Techniques

Architecture

XSA

Applied XSA to all layers instead of only the last few layers.

parameters: {"layers":"all"}

LeakyReLU

Used LeakyReLU squared activation.

parameters: {"alpha":0.5,"squared":true}

Partial RoPE

Used partial rotary position embeddings with some dimensions left position-free.

parameters: {"dimensions":"16/64"}

MLP3x

Increased MLP multiplier from baseline to 3x.

parameters: {"multiplier":3}

GQA

Used grouped query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

qk_gain

Scaled QK dot products with a higher initialization gain.

parameters: {"init":4}

Regularization

LN scale

parameters: {"scale":"1/sqrt(layer+1)"}

logit softcap

parameters: {"value":20}

Optimizer

Muon

weight_decay: 0.06

momentum: 0.95

other_params: {"matrix_lr":0.04,"momentum_warmup_start":0.85,"momentum_warmup_steps":200,"grad_clip_norm":0.3}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_iters":200}

linear warmup

parameters: {"warmup_steps":20}

Quantization

int8

bits: 8

scope: all

Compression

zlib

level: null

Weight Averaging

EMA

parameters: {"start_step":0}

SWA

parameters: {"scale_threshold":0.5,"interval":50}

Novel Contributions

XSA applied to all layers instead of only the deepest layers
qk_gain_init sweep showing 4.0 is better than the default 1.5
Warmdown calibration for wallclock-capped training
Observation that pre-quantization improvements can degrade post-quantization performance
Systematic 45-experiment exploration on a single RTX 5090