PR #1264

open

[non-record track] Hierarchical Shared Attention (HSA): multi-level sharing across attention heads

by andrewmouldonView on GitHub

val_bpb

1.2225

Architecture

Transformer

Optimizer

—

Artifact Size

15887509 bytes

Training Techniques

Architecture

attention projections

Hierarchical Shared Attention with multi-level sharing across attention heads by decomposing features into shared, group-shared, and head-specific components.

parameters: {"q_levels":[[2,8],[4,16],[8,40]],"kv_levels":[[1,16],[2,16],[4,32]]}

KV head count

Uses hierarchical sharing in key/value projections to reduce redundancy and KV-cache duplication while preserving specialization.

parameters: {"kv_heads":4,"head_dim":64}

Regularization

weight decay

parameters: null

Novel Contributions

Introduces Hierarchical Shared Attention (HSA) for multi-level sharing across attention heads.
Combines MQA-style, GQA-style, and head-specific features within a single attention projection hierarchy.
Uses learned per-head scaling to specialize shared representations with minimal cost.
Reduces parameter count in QKV projections and KV-cache size while preserving expressivity.
Shows improved BPB over baseline under 10k fixed-step training with a consistent parameter budget.