PR #1264

open

[non-record track] Hierarchical Shared Attention (HSA): multi-level sharing across attention heads

by andrewmouldonView on GitHub
val_bpb
1.2225
Architecture
Transformer
Optimizer
Artifact Size
15887509 bytes

Training Techniques

Architecture
attention projections
Hierarchical Shared Attention with multi-level sharing across attention heads by decomposing features into shared, group-shared, and head-specific components.
parameters: {"q_levels":[[2,8],[4,16],[8,40]],"kv_levels":[[1,16],[2,16],[4,32]]}
KV head count
Uses hierarchical sharing in key/value projections to reduce redundancy and KV-cache duplication while preserving specialization.
parameters: {"kv_heads":4,"head_dim":64}
Regularization
weight decay
parameters: null

Novel Contributions

  • Introduces Hierarchical Shared Attention (HSA) for multi-level sharing across attention heads.
  • Combines MQA-style, GQA-style, and head-specific features within a single attention projection hierarchy.
  • Uses learned per-head scaling to specialize shared representations with minimal cost.
  • Reduces parameter count in QKV projections and KV-cache size while preserving expressivity.
  • Shows improved BPB over baseline under 10k fixed-step training with a consistent parameter budget.