PR #607

open

11L AttnRes + Gated Attention + Looped Blocks + EMA + Cosine + QAT

val_bpb

1.4750

Architecture

Transformer

Optimizer

—

Artifact Size

13.7MB

Training Techniques

Architecture

Block Attention Residuals

Replaces fixed skip_weights with learned depth routing using softmax attention over all encoder outputs with per-decoder-layer pseudo-queries

parameters: null

Per-head gated attention

Learnable sigmoid gate per attention head to prevent attention-sink pathology

parameters: null

Looped middle blocks

Layers 4-7 run twice per forward pass, adding compute depth without increasing parameters

parameters: {"layers":"4-7","repeat":2}

Weight Averaging

EMA

parameters: {"decay":0.995}

LR Schedule

cosine decay

parameters: null

Quantization

QAT

bits: 8

scope: per-row

Block Attention Residuals replacing fixed skip_weights with learned depth routing
Per-head gated attention to prevent attention-sink pathology
Looped middle blocks (layers 4-7 run twice) for zero-param compute depth
Use of EMA with decay 0.995 for weight averaging
Cosine learning rate decay replacing linear warmdown
Quantization Aware Training (QAT) simulating int8 per-row quantization in last 15% of training