PR #939

open

Non-record: GatedDeltaNet, 32K Context, Document-Boundary State Reset

by brian386View on GitHub
val_bpb
1.2519
Architecture
Transformer
Optimizer
Artifact Size

Training Techniques

Architecture
GatedDeltaNet
Replaces softmax attention with linear recurrent attention.
parameters: {"heads":4,"head_dim":128}
depth
Reduced model depth to fit within byte limit.
parameters: {"layers":7}
MLP
Reduced MLP expansion ratio to fit within byte limit.
parameters: {"expansion":1.875}
GatedDeltaNet
Document-boundary state reset using BOS-detected variable-length chunked kernel to zero recurrent state at document boundaries.
parameters: null
Sequence Length
sequence_length
train_length: 32768
eval_length: 32768
sequence_length
train_length: 1024
eval_length: 1024
sequence_length
train_length: 8192
eval_length: 8192
sequence_length
train_length: 16384
eval_length: 16384
Regularization
gradient clipping
parameters: {"norm":1}

Novel Contributions

  • GatedDeltaNet recurrent attention for long-context training
  • Document-boundary state reset to prevent hidden-state bleed across packed documents
  • 32k-context training and evaluation with minimal per-step compute overhead
  • Gradient clipping to stabilize long recurrent chains
  • Architecture reductions to fit within the byte limit