PR #932

open

Non-record: CoDA-GQA Differential Attention — First Differential Attention Submission (val_bpb=1.1580)

by anthony-maioView on GitHub
val_bpb
1.1580
Architecture
Transformer
Optimizer
Muon
Artifact Size

Training Techniques

Architecture
GQA
Grouped query attention base architecture with differential attention added via CoDA-GQA.
parameters: {"layers":11,"d_model":512,"heads":8,"kv_heads":4}
differential attention
Subtracts a gated inhibitory noise attention stream from the signal attention stream using an orthogonally rotated noise query.
parameters: null
Partial RoPE
Uses partial rotary positional embeddings.
parameters: {"train_length":64,"eval_length":16}
BigramHash
Bigram hash embedding component.
parameters: {"dimensions":2048}
XSA
XSA attention-related architectural component.
parameters: null
VE128
Value residual / value embedding enhancement with VE128.
parameters: {"size":128}
SmearGate
SmearGate architectural component.
parameters: null
U-Net skip connections
U-Net style skip connections in the model.
parameters: null
LeakyReLU
LeakyReLU squared MLP activation.
parameters: {"slope":0.5}
Weight Averaging
EMA
parameters: {"decay":0.997}
Tight SWA
parameters: null
Quantization
GPTQ-lite
bits: 6
scope: model
late QAT
bits: null
scope: model
Compression
lzma
level: null
Regularization
LN scale
parameters: null
weight decay
parameters: {"value":0.04}

Novel Contributions

  • First differential attention submission to Parameter Golf
  • CoDA-GQA differential attention with orthogonally rotated noise query
  • No second W_q matrix needed for the noise stream
  • Gated subtraction of inhibitory noise attention from signal attention
  • Controlled ablation showing stable training but worse val_bpb under the 600-second budget