PR #1264
open[non-record track] Hierarchical Shared Attention (HSA): multi-level sharing across attention heads
by andrewmouldonView on GitHub
val_bpb
1.2225
Architecture
Transformer
Optimizer
—
Artifact Size
15887509 bytes
Training Techniques
Architecture
attention projections
Hierarchical Shared Attention with multi-level sharing across attention heads by decomposing features into shared, group-shared, and head-specific components.
parameters: {"q_levels":[[2,8],[4,16],[8,40]],"kv_levels":[[1,16],[2,16],[4,32]]}
KV head count
Uses hierarchical sharing in key/value projections to reduce redundancy and KV-cache duplication while preserving specialization.
parameters: {"kv_heads":4,"head_dim":64}
Regularization
weight decay
parameters: null
Novel Contributions
- Introduces Hierarchical Shared Attention (HSA) for multi-level sharing across attention heads.
- Combines MQA-style, GQA-style, and head-specific features within a single attention projection hierarchy.
- Uses learned per-head scaling to specialize shared representations with minimal cost.
- Reduces parameter count in QKV projections and KV-cache size while preserving expressivity.
- Shows improved BPB over baseline under 10k fixed-step training with a consistent parameter budget.