PR #2084

open

Non-record: k-XSA (k=2) rank-2 subspace XSA with energy cap

by ShadowNinja10View on GitHub

val_bpb

1.1093

Architecture

Transformer

Optimizer

Muon

Artifact Size

16.68 MB

Training Techniques

Architecture

XSA

Rank-2 exclusive self-attention using an anchored subspace basis [v, W x] with orthogonal projection and per-head scaling.

parameters: {"rank":2,"basis":"v+x","target":"attn"}

XSA

Energy-cap variant that soft-thresholds each projection coefficient with learned per-head, per-rank thresholds.

parameters: {"energy_cap":1,"threshold_init":0}

Optimizer

AdamW

weight_decay: 0

momentum: null

other_params: {"used_for":"xsa alpha/threshold parameters","lr":0.02}

Muon

weight_decay: null

momentum: null

other_params: {"used_for":"matrix parameters"}

AdamW

weight_decay: null

momentum: null

other_params: {"used_for":"embeddings"}

Weight Averaging

EMA

parameters: null

Quantization

GPTQ

bits: 6

scope: matrices

int8

bits: 8

scope: embeddings

Compression

brotli

level: null

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: null

eval_length: null

Novel Contributions

Generalization of paper-XSA from rank-1 to rank-2 orthogonal projection in a learned subspace.
Anchored basis design using [v, W x] to preserve the original XSA direction while adding a learned direction.
Per-direction energy capping via learned soft-thresholding of projection coefficients.
Identification and fix of the hidden weight-decay confound on learnable XSA alpha parameters.
Observation that energy capping makes the added basis projection matrix essentially free to int6-quantize.
Reported improvement over paper-XSA in both pre-quantized and quantized validation bpb.