PR #607
open11L AttnRes + Gated Attention + Looped Blocks + EMA + Cosine + QAT
by NeopolitaView on GitHub
val_bpb
1.4750
Architecture
Transformer
Optimizer
—
Artifact Size
13.7MB
Training Techniques
Architecture
Block Attention Residuals
Replaces fixed skip_weights with learned depth routing using softmax attention over all encoder outputs with per-decoder-layer pseudo-queries
parameters: null
Per-head gated attention
Learnable sigmoid gate per attention head to prevent attention-sink pathology
parameters: null
Looped middle blocks
Layers 4-7 run twice per forward pass, adding compute depth without increasing parameters
parameters: {"layers":"4-7","repeat":2}
Weight Averaging
EMA
parameters: {"decay":0.995}
LR Schedule
cosine decay
parameters: null
Quantization
QAT
bits: 8
scope: per-row
Novel Contributions
- Block Attention Residuals replacing fixed skip_weights with learned depth routing
- Per-head gated attention to prevent attention-sink pathology
- Looped middle blocks (layers 4-7 run twice) for zero-param compute depth
- Use of EMA with decay 0.995 for weight averaging
- Cosine learning rate decay replacing linear warmdown
- Quantization Aware Training (QAT) simulating int8 per-row quantization in last 15% of training