PR #2149

open

Non-record: SP8192 + RandProj384 tied embeddings + Pairwise-QK Muon -- Single-seed negative result

by YaseenHQView on GitHub
val_bpb
1.1269
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,438,770 bytes

Training Techniques

Architecture
weight tying
Random-projection tied embeddings (RandProj384) used to tie embeddings via a lower-dimensional projection.
parameters: {"projection_dim":384}
attention modifications
Pairwise-head Muon orthogonalization applied to Q/K projections (PairMuonQK).
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"pairwise_head_qk_orthogonalization":true}
Weight Averaging
EMA
parameters: null

Novel Contributions

  • Random-projection tied embeddings (RandProj384)
  • Pairwise-head Muon orthogonalization for Q/K (PairMuonQK)
  • Negative-result submission showing these ideas regressed validation quality
  • Legal sub-16MB artifact produced within the training cap