PR #542
openNon-Record: DG Attention, Differential-Gated Attention with Depth-Scheduled Novelty Encoding: (val_bpb=1.1898)
by ddavidgaoView on GitHub
val_bpb
1.1898
Architecture
Transformer
Optimizer
—
Artifact Size
16.6MB
Training Techniques
Architecture
DG Attention
Novel attention mechanism where deep layers transmit the novelty (difference) of token values relative to a causal baseline instead of raw content, with a depth schedule hardcoded for β gating.
parameters: null
Flash Attention
Use of Flash Attention (F.scaled_dot_product_attention) for efficient scaled dot-product attention in the asymmetric Designator projections.
parameters: null
Other
other
Depth-scheduled gating of payload between raw content and differential signal using learned or hardcoded β per layer to encode novelty in deep layers.
parameters: null
Novel Contributions
- Introduction of Differential-Gated (DG) Attention where deep layers transmit novelty (difference) of token values relative to a causal running baseline instead of raw content.
- Asymmetric Designator (D_q/D_k) projections for matching tokens, distinct from standard QKV attention.
- Empirically discovered and hardcoded depth schedule for β gating to control mixture of raw content and differential signal per layer, preventing gate collapse.
- Use of Flash Attention for efficient scaled dot-product attention in the DG mechanism.
- Demonstration that differential payload encoding leads to a durable advantage in bits-per-byte (BPB) after mid-training despite initial slower convergence.
- Distinction from Microsoft's Differential Transformer by differencing value payloads rather than attention score maps.
- Hybrid payload formulation combining raw content and differential signal with learned β gating per layer.
- Empirical analysis of β trajectories under different batch sizes and training conditions, motivating architectural hardcoding of depth schedule.