PR #542

open

Non-Record: DG Attention, Differential-Gated Attention with Depth-Scheduled Novelty Encoding: (val_bpb=1.1898)

by ddavidgaoView on GitHub

val_bpb

1.1898

Architecture

Transformer

Optimizer

—

Artifact Size

16.6MB

Training Techniques

Architecture

DG Attention

Novel attention mechanism where deep layers transmit the novelty (difference) of token values relative to a causal baseline instead of raw content, with a depth schedule hardcoded for β gating.

parameters: null

Flash Attention

Use of Flash Attention (F.scaled_dot_product_attention) for efficient scaled dot-product attention in the asymmetric Designator projections.

parameters: null

Other

other

Depth-scheduled gating of payload between raw content and differential signal using learned or hardcoded β per layer to encode novelty in deep layers.

parameters: null

Novel Contributions

Introduction of Differential-Gated (DG) Attention where deep layers transmit novelty (difference) of token values relative to a causal running baseline instead of raw content.
Asymmetric Designator (D_q/D_k) projections for matching tokens, distinct from standard QKV attention.
Empirically discovered and hardcoded depth schedule for β gating to control mixture of raw content and differential signal per layer, preventing gate collapse.
Use of Flash Attention for efficient scaled dot-product attention in the DG mechanism.
Demonstration that differential payload encoding leads to a durable advantage in bits-per-byte (BPB) after mid-training despite initial slower convergence.
Distinction from Microsoft's Differential Transformer by differencing value payloads rather than attention score maps.
Hybrid payload formulation combining raw content and differential signal with learned β gating per layer.
Empirical analysis of β trajectories under different batch sizes and training conditions, motivating architectural hardcoding of depth schedule.