PR #1249
openNon-record: GQA + LZMA + SLOT eval optimization (val_bpb=1.1240)
by ibarrajoView on GitHub
val_bpb
1.1240
Architecture
Transformer
Optimizer
—
Artifact Size
14.0 MB
Training Techniques
Architecture
GQA
Grouped query attention used as the base attention architecture.
parameters: null
Compression
lzma
level: null
Evaluation
SLOT
parameters: null
Novel Contributions
- Uses a GQA + LZMA stack with a strong base model BPB
- Applies SLOT eval-time delta optimization to improve validation BPB
- Achieves a 0.009 BPB improvement at evaluation time over the base model
- Demonstrates SLOT as an alternative to TTT when attention capacity is limited