PR #1774

open

Record: 12L Shared-Specific Attention (d=16) + MLP 4.5x (3-seed mean val_bpb 1.0981)

by aruniyerView on GitHub
val_bpb
1.0981
Architecture
Transformer
Optimizer
Artifact Size
~15.99 MB

Training Techniques

Architecture
Shared-Specific Attention
Splits Q/K projections into shared and specific dimensions, averaging the shared portion across heads to reduce artifact size with minimal BPB cost.
parameters: {"shared_head_dim":16,"specific_dim":48}
depth
Increased model depth by adding one physical layer.
parameters: {"layers":12}
MLP4.5x
Widened the MLP to use more of the artifact budget.
parameters: {"multiplier":4.5}
RoPE
RoPE positional encoding applied only to the specific dimensions of attention.
parameters: {"dimensions":16}
Evaluation
sliding window eval
parameters: null

Novel Contributions

  • Introduced shared-specific attention to compress Q/K projections by averaging part of each head across heads.
  • Used the saved artifact budget to enable both an extra layer and a wider MLP within the 16 MB limit.
  • Demonstrated a 12-layer no-TTT model with strong sliding-window validation performance.
  • Showed that shared-specific attention can reduce artifact size with near-zero BPB cost.