PR #1774

open

Record: 12L Shared-Specific Attention (d=16) + MLP 4.5x (3-seed mean val_bpb 1.0981)

val_bpb

1.0981

Architecture

Transformer

Optimizer

—

Artifact Size

~15.99 MB

Training Techniques

Architecture

Shared-Specific Attention

Splits Q/K projections into shared and specific dimensions, averaging the shared portion across heads to reduce artifact size with minimal BPB cost.

parameters: {"shared_head_dim":16,"specific_dim":48}

depth

Increased model depth by adding one physical layer.

parameters: {"layers":12}

MLP4.5x

Widened the MLP to use more of the artifact budget.

parameters: {"multiplier":4.5}

RoPE

RoPE positional encoding applied only to the specific dimensions of attention.

parameters: {"dimensions":16}

Evaluation

sliding window eval

parameters: null

Introduced shared-specific attention to compress Q/K projections by averaging part of each head across heads.
Used the saved artifact budget to enable both an extra layer and a wider MLP within the 16 MB limit.
Demonstrated a 12-layer no-TTT model with strong sliding-window validation performance.
Showed that shared-specific attention can reduce artifact size with near-zero BPB cost.