PR #932

open

Non-record: CoDA-GQA Differential Attention — First Differential Attention Submission (val_bpb=1.1580)

by anthony-maioView on GitHub

val_bpb

1.1580

Architecture

Transformer

Optimizer

Muon

Artifact Size

—

Training Techniques

Architecture

GQA

Grouped query attention base architecture with differential attention added via CoDA-GQA.

parameters: {"layers":11,"d_model":512,"heads":8,"kv_heads":4}

differential attention

Subtracts a gated inhibitory noise attention stream from the signal attention stream using an orthogonally rotated noise query.

parameters: null

Partial RoPE

Uses partial rotary positional embeddings.

parameters: {"train_length":64,"eval_length":16}

BigramHash

Bigram hash embedding component.

parameters: {"dimensions":2048}

XSA

XSA attention-related architectural component.

parameters: null

VE128

Value residual / value embedding enhancement with VE128.

parameters: {"size":128}

SmearGate

SmearGate architectural component.

parameters: null

U-Net skip connections

U-Net style skip connections in the model.

parameters: null

LeakyReLU

LeakyReLU squared MLP activation.

parameters: {"slope":0.5}

Weight Averaging

EMA

parameters: {"decay":0.997}

Tight SWA

parameters: null

Quantization

GPTQ-lite

bits: 6

scope: model

late QAT

bits: null

scope: model

Compression

lzma

level: null

Regularization

LN scale

parameters: null

weight decay

parameters: {"value":0.04}

Novel Contributions

First differential attention submission to Parameter Golf
CoDA-GQA differential attention with orthogonally rotated noise query
No second W_q matrix needed for the noise stream
Gated subtraction of inhibitory noise attention from signal attention
Controlled ablation showing stable training but worse val_bpb under the 600-second budget