PR #1025

open

non-record: MASA low-rank shared attention + SwiGLU, 1.3579 BPB

val_bpb

1.3579

Architecture

Transformer

Optimizer

—

Artifact Size

20.98MB

Training Techniques

Architecture

shared attention

All 11 layers share a set of low-rank base matrices instead of unique Q/K/V/O weights per layer; each layer learns mix coefficients.

parameters: {"layers":11,"bases":10,"rank":128}

MLP3x

Uses a SwiGLU MLP with 3x expansion instead of ReLU squared.

parameters: {"multiplier":3}

SwiGLU

SwiGLU activation in the MLP.

parameters: {"hidden":341}

KV head count

Uses 8 attention heads and 8 KV heads.

parameters: {"heads":8,"kv_heads":8}

LR Schedule

warmdown

parameters: {"warmdown_start":16000,"iterations":20000}

Sequence Length

sequence_length

train_length: 512

eval_length: null

MASA (Matrix Atom Sharing Attention) with low-rank shared base matrices across all layers
Per-layer mixing coefficients instead of separate Q/K/V/O weights
SwiGLU MLP replacement
Warmdown fix for learning rate decay
Low-rank base matrices to improve parameter efficiency