PR #1025

open

non-record: MASA low-rank shared attention + SwiGLU, 1.3579 BPB

by Zagot-byteView on GitHub
val_bpb
1.3579
Architecture
Transformer
Optimizer
Artifact Size
20.98MB

Training Techniques

Architecture
shared attention
All 11 layers share a set of low-rank base matrices instead of unique Q/K/V/O weights per layer; each layer learns mix coefficients.
parameters: {"layers":11,"bases":10,"rank":128}
MLP3x
Uses a SwiGLU MLP with 3x expansion instead of ReLU squared.
parameters: {"multiplier":3}
SwiGLU
SwiGLU activation in the MLP.
parameters: {"hidden":341}
KV head count
Uses 8 attention heads and 8 KV heads.
parameters: {"heads":8,"kv_heads":8}
LR Schedule
warmdown
parameters: {"warmdown_start":16000,"iterations":20000}
Sequence Length
sequence_length
train_length: 512
eval_length: null

Novel Contributions

  • MASA (Matrix Atom Sharing Attention) with low-rank shared base matrices across all layers
  • Per-layer mixing coefficients instead of separate Q/K/V/O weights
  • SwiGLU MLP replacement
  • Warmdown fix for learning rate decay
  • Low-rank base matrices to improve parameter efficiency