PR #1671

open

Record: Gated Residual Scaling (Token-wise) for Attention + MLP - 1.3827 BPB

by souro26View on GitHub
val_bpb
1.3827
Architecture
Transformer
Optimizer
Artifact Size

Training Techniques

Architecture
Gated Attention
Applies token-wise sigmoid gates to attention residual updates using a learned linear projection dim→1.
parameters: null
Gated Attention
Applies token-wise sigmoid gates to MLP residual updates using a learned linear projection dim→1.
parameters: null

Novel Contributions

  • Token-wise gated residual scaling for attention updates
  • Token-wise gated residual scaling for MLP updates
  • Adaptive per-token residual update strength via learned sigmoid gates
  • Improved validation BPB from 1.3902 to 1.3827