PR #1248

open

Non-record: GQA + LZMA + Selective Pruning (val_bpb=1.1264)

by ibarrajoView on GitHub
val_bpb
1.1264
Architecture
Transformer
Optimizer
Adam
Artifact Size
14.1 MB

Training Techniques

Architecture
GQA
Grouped query attention with reduced KV heads
parameters: {"kv_heads":4}
BigramHash
BigramHash embedding with 3072x112 configuration
parameters: {"dimensions":3072,"embedding_dim":112}
Compression
lzma
level: null
Regularization
magnitude pruning
parameters: {"selective":true}
Optimizer
Adam
weight_decay: null
momentum: null
other_params: {"fused":true}
Other
other
ramfs data staging
parameters: null
Test-Time Training
score-first TTT
parameters: null

Novel Contributions

  • GQA with 4 KV heads for faster steps
  • LZMA compression to improve artifact packing efficiency
  • Selective pruning that activates only when the artifact exceeds budget
  • BigramHash 3072x112
  • Score-first TTT improvement to 1.1264 BPB
  • Adoption of fused Adam and ramfs data staging from merged SOTA #1019