PR #1125

open

Non-record: XSA-All + QK Gain 4.0 + LN Scale — 45 Experiments on 1×RTX 5090

by jainpranjal97View on GitHub
val_bpb
1.1946
Architecture
Transformer
Optimizer
Muon
Artifact Size
18.1 MB

Training Techniques

Architecture
XSA
Applied XSA to all layers instead of only the last few layers.
parameters: {"layers":"all"}
LeakyReLU
Used LeakyReLU squared activation.
parameters: {"alpha":0.5,"squared":true}
Partial RoPE
Used partial rotary position embeddings with some dimensions left position-free.
parameters: {"dimensions":"16/64"}
MLP3x
Increased MLP multiplier from baseline to 3x.
parameters: {"multiplier":3}
GQA
Used grouped query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
qk_gain
Scaled QK dot products with a higher initialization gain.
parameters: {"init":4}
Regularization
LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
logit softcap
parameters: {"value":20}
Optimizer
Muon
weight_decay: 0.06
momentum: 0.95
other_params: {"matrix_lr":0.04,"momentum_warmup_start":0.85,"momentum_warmup_steps":200,"grad_clip_norm":0.3}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_iters":200}
linear warmup
parameters: {"warmup_steps":20}
Quantization
int8
bits: 8
scope: all
Compression
zlib
level: null
Weight Averaging
EMA
parameters: {"start_step":0}
SWA
parameters: {"scale_threshold":0.5,"interval":50}

Novel Contributions

  • XSA applied to all layers instead of only the deepest layers
  • qk_gain_init sweep showing 4.0 is better than the default 1.5
  • Warmdown calibration for wallclock-capped training
  • Observation that pre-quantization improvements can degrade post-quantization performance
  • Systematic 45-experiment exploration on a single RTX 5090