PR #2084

open

Non-record: k-XSA (k=2) rank-2 subspace XSA with energy cap

by ShadowNinja10View on GitHub
val_bpb
1.1093
Architecture
Transformer
Optimizer
Muon
Artifact Size
16.68 MB

Training Techniques

Architecture
XSA
Rank-2 exclusive self-attention using an anchored subspace basis [v, W x] with orthogonal projection and per-head scaling.
parameters: {"rank":2,"basis":"v+x","target":"attn"}
XSA
Energy-cap variant that soft-thresholds each projection coefficient with learned per-head, per-rank thresholds.
parameters: {"energy_cap":1,"threshold_init":0}
Optimizer
AdamW
weight_decay: 0
momentum: null
other_params: {"used_for":"xsa alpha/threshold parameters","lr":0.02}
Muon
weight_decay: null
momentum: null
other_params: {"used_for":"matrix parameters"}
AdamW
weight_decay: null
momentum: null
other_params: {"used_for":"embeddings"}
Weight Averaging
EMA
parameters: null
Quantization
GPTQ
bits: 6
scope: matrices
int8
bits: 8
scope: embeddings
Compression
brotli
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: null
eval_length: null

Novel Contributions

  • Generalization of paper-XSA from rank-1 to rank-2 orthogonal projection in a learned subspace.
  • Anchored basis design using [v, W x] to preserve the original XSA direction while adding a learned direction.
  • Per-direction energy capping via learned soft-thresholding of projection coefficients.
  • Identification and fix of the hidden weight-decay confound on learnable XSA alpha parameters.
  • Observation that energy capping makes the added basis projection matrix essentially free to int6-quantize.
  • Reported improvement over paper-XSA in both pre-quantized and quantized validation bpb.