PR #1671
openRecord: Gated Residual Scaling (Token-wise) for Attention + MLP - 1.3827 BPB
by souro26View on GitHub
val_bpb
1.3827
Architecture
Transformer
Optimizer
—
Artifact Size
—
Training Techniques
Architecture
Gated Attention
Applies token-wise sigmoid gates to attention residual updates using a learned linear projection dim→1.
parameters: null
Gated Attention
Applies token-wise sigmoid gates to MLP residual updates using a learned linear projection dim→1.
parameters: null
Novel Contributions
- Token-wise gated residual scaling for attention updates
- Token-wise gated residual scaling for MLP updates
- Adaptive per-token residual update strength via learned sigmoid gates
- Improved validation BPB from 1.3902 to 1.3827