PR #345

open

Non-record: DART - Differential Attention Recurrent Transformer (Student submission, Kerala)

by anandks2006View on GitHub
val_bpb
1.8522
Architecture
Differential Attention Recurrent Transformer
Optimizer
Muon
Artifact Size
3.55MB

Training Techniques

Architecture
depth recurrence
Shared-weight recurrent transformer block reused across multiple loops instead of stacking independent layers.
parameters: {"loops":4}
Differential Attention V2
Uses two attention calculations and subtracts one from the other to suppress irrelevant tokens.
parameters: null
low-rank Q delta
Per-loop low-rank query modifications to specialize each recurrent pass.
parameters: {"loops":4,"parameters":65536}
resid_mix
Learned balance between current hidden state and original input to reduce drift across loops.
parameters: null
loop position embeddings
Adds a learned embedding indicating which recurrent pass the block is on.
parameters: null
U-Net skip connections
Early loop hidden states are saved and later loops receive them in reverse order.
parameters: null
memory tokens
Learned global tokens that carry information across loops like a shared notepad.
parameters: {"count":16}
Quantization
QAT
bits: 8
scope: all
Other
other
Deep supervision with loss computed after every recurrent loop, not only the final output.
parameters: null
Regularization
dropout
parameters: {"loop_dropout":true}
Sequence Length
sequence_length
train_length: 256
eval_length: null
Compression
zlib
level: null

Novel Contributions

  • Shared-weight recurrent transformer design for repeated computation over the same block
  • Differential Attention V2 integration
  • Per-loop low-rank Q delta specialization
  • resid_mix mechanism to stabilize recurrent passes
  • Loop position embeddings
  • U-Net style skip connections across loops
  • Global memory tokens shared across loops
  • Deep supervision at every loop
  • Quantization-aware training matched to int8 submission quantization
  • Loop dropout discovered as a fix for shared-weight gradient conflict