PR #542

open

Non-Record: DG Attention, Differential-Gated Attention with Depth-Scheduled Novelty Encoding: (val_bpb=1.1898)

by ddavidgaoView on GitHub
val_bpb
1.1898
Architecture
Transformer
Optimizer
Artifact Size
16.6MB

Training Techniques

Architecture
DG Attention
Novel attention mechanism where deep layers transmit the novelty (difference) of token values relative to a causal baseline instead of raw content, with a depth schedule hardcoded for β gating.
parameters: null
Flash Attention
Use of Flash Attention (F.scaled_dot_product_attention) for efficient scaled dot-product attention in the asymmetric Designator projections.
parameters: null
Other
other
Depth-scheduled gating of payload between raw content and differential signal using learned or hardcoded β per layer to encode novelty in deep layers.
parameters: null

Novel Contributions

  • Introduction of Differential-Gated (DG) Attention where deep layers transmit novelty (difference) of token values relative to a causal running baseline instead of raw content.
  • Asymmetric Designator (D_q/D_k) projections for matching tokens, distinct from standard QKV attention.
  • Empirically discovered and hardcoded depth schedule for β gating to control mixture of raw content and differential signal per layer, preventing gate collapse.
  • Use of Flash Attention for efficient scaled dot-product attention in the DG mechanism.
  • Demonstration that differential payload encoding leads to a durable advantage in bits-per-byte (BPB) after mid-training despite initial slower convergence.
  • Distinction from Microsoft's Differential Transformer by differencing value payloads rather than attention score maps.
  • Hybrid payload formulation combining raw content and differential signal with learned β gating per layer.
  • Empirical analysis of β trajectories under different batch sizes and training conditions, motivating architectural hardcoding of depth schedule.