PR #2029

open

[Non-record] Tokenformer (Pattention): 40% smaller artifact at matched params — first

val_bpb

1.3771

Architecture

Transformer

Optimizer

Muon

Artifact Size

9,579,528 bytes

Training Techniques

Architecture

Pattention

Replaces dense MLP linear layers with cross-attention over learnable parameter tokens (K/V tables) while keeping attention layers dense.

parameters: {"p_ratio":1,"p_tokens":341,"scope":"MLP fc/proj only"}

ReLU²

Uses relu^2 activation between the two Pattention layers.

parameters: null

Quantization

int8

bits: 8

scope: model weights

Compression

zlib

level: null

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"adam_split":true,"auto_routes_2d_params_to_muon":true}

LR Schedule

warmdown

parameters: null

First-known Tokenformer/Pattention attempt in the challenge
Matched-parameter Pattention swap reduces compressed artifact size by about 40%
Artifact budget savings come primarily from improved zlib compressibility of Pattention K/V tables
Demonstrates that the same int8+zlib pipeline can yield much smaller artifacts without changing parameter count