PR #1214

open

Non-record: Emergent weight symmetry in QO projections + learnable SymMix

val_bpb
1.1688
Architecture
Transformer
Optimizer
Muon
Artifact Size
14,799 KB

Training Techniques

Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
Architecture
U-Net skip connections
Transformer variant with encoder-decoder skip connections; symmetry emerges in middle decoder QO layers.
parameters: {"layers":11}
Quantization
int6
bits: 6
scope: artifact weights
Compression
lzma
level: null
Other
other
Learnable SymMix applied to QO matrices: W_eff = W + tanh(beta) * W^T with one scalar beta per matrix.
parameters: {"num_parameters":22}
other
Post-training force-symmetrization of selected QO projections using (W + W^T) / 2 before quantization.
parameters: {"layers":[6,7,8]}

Novel Contributions

  • Discovery of emergent near-perfect symmetry in layers 6-8 O projections during full training
  • Identification of a sharp bimodal symmetry pattern across QO matrices
  • Introduction of learnable SymMix for QO projections
  • Post-training force-symmetrization of near-symmetric layers to reduce artifact size
  • Structural analysis showing no other exploitable symmetry or block structure in the model