PR #939

open

Non-record: GatedDeltaNet, 32K Context, Document-Boundary State Reset

val_bpb

1.2519

Architecture

Transformer

Optimizer

—

Artifact Size

—

Training Techniques

Architecture

GatedDeltaNet

Replaces softmax attention with linear recurrent attention.

parameters: {"heads":4,"head_dim":128}

depth

Reduced model depth to fit within byte limit.

parameters: {"layers":7}

MLP

Reduced MLP expansion ratio to fit within byte limit.

parameters: {"expansion":1.875}

GatedDeltaNet

Document-boundary state reset using BOS-detected variable-length chunked kernel to zero recurrent state at document boundaries.

parameters: null

Sequence Length

sequence_length

train_length: 32768

eval_length: 32768

sequence_length

train_length: 1024

eval_length: 1024

sequence_length

train_length: 8192

eval_length: 8192

sequence_length

train_length: 16384

eval_length: 16384

Regularization

gradient clipping

parameters: {"norm":1}

GatedDeltaNet recurrent attention for long-context training
Document-boundary state reset to prevent hidden-state bleed across packed documents
32k-context training and evaluation with minimal per-step compute overhead
Gradient clipping to stabilize long recurrent chains
Architecture reductions to fit within the byte limit