PR #2029

open

[Non-record] Tokenformer (Pattention): 40% smaller artifact at matched params — first

by alexdwu13View on GitHub
val_bpb
1.3771
Architecture
Transformer
Optimizer
Muon
Artifact Size
9,579,528 bytes

Training Techniques

Architecture
Pattention
Replaces dense MLP linear layers with cross-attention over learnable parameter tokens (K/V tables) while keeping attention layers dense.
parameters: {"p_ratio":1,"p_tokens":341,"scope":"MLP fc/proj only"}
ReLU²
Uses relu^2 activation between the two Pattention layers.
parameters: null
Quantization
int8
bits: 8
scope: model weights
Compression
zlib
level: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"adam_split":true,"auto_routes_2d_params_to_muon":true}
LR Schedule
warmdown
parameters: null

Novel Contributions

  • First-known Tokenformer/Pattention attempt in the challenge
  • Matched-parameter Pattention swap reduces compressed artifact size by about 40%
  • Artifact budget savings come primarily from improved zlib compressibility of Pattention K/V tables
  • Demonstrates that the same int8+zlib pipeline can yield much smaller artifacts without changing parameter count