PR #1249

open

Non-record: GQA + LZMA + SLOT eval optimization (val_bpb=1.1240)

by ibarrajoView on GitHub
val_bpb
1.1240
Architecture
Transformer
Optimizer
Artifact Size
14.0 MB

Training Techniques

Architecture
GQA
Grouped query attention used as the base attention architecture.
parameters: null
Compression
lzma
level: null
Evaluation
SLOT
parameters: null

Novel Contributions

  • Uses a GQA + LZMA stack with a strong base model BPB
  • Applies SLOT eval-time delta optimization to improve validation BPB
  • Achieves a 0.009 BPB improvement at evaluation time over the base model
  • Demonstrates SLOT as an alternative to TTT when attention capacity is limited