PR #1774
openRecord: 12L Shared-Specific Attention (d=16) + MLP 4.5x (3-seed mean val_bpb 1.0981)
by aruniyerView on GitHub
val_bpb
1.0981
Architecture
Transformer
Optimizer
—
Artifact Size
~15.99 MB
Training Techniques
Architecture
Shared-Specific Attention
Splits Q/K projections into shared and specific dimensions, averaging the shared portion across heads to reduce artifact size with minimal BPB cost.
parameters: {"shared_head_dim":16,"specific_dim":48}
depth
Increased model depth by adding one physical layer.
parameters: {"layers":12}
MLP4.5x
Widened the MLP to use more of the artifact budget.
parameters: {"multiplier":4.5}
RoPE
RoPE positional encoding applied only to the specific dimensions of attention.
parameters: {"dimensions":16}
Evaluation
sliding window eval
parameters: null
Novel Contributions
- Introduced shared-specific attention to compress Q/K projections by averaging part of each head across heads.
- Used the saved artifact budget to enable both an extra layer and a wider MLP within the 16 MB limit.
- Demonstrated a 12-layer no-TTT model with strong sliding-window validation performance.
- Showed that shared-specific attention can reduce artifact size with near-zero BPB cost.