PR #1250

open

Non-record: Full Attention + LZMA + small BigramHash (val_bpb=1.2094)

by ibarrajoView on GitHub
val_bpb
1.2094
Architecture
Transformer
Optimizer
Artifact Size
12.3 MB

Training Techniques

Architecture
KV head count
Uses full attention with 8 KV heads instead of GQA.
parameters: {"kv_heads":8}
BigramHash
Uses a smaller BigramHash embedding table.
parameters: {"dimensions":"3072x112"}
Compression
lzma
level: null
Test-Time Training
TTT
parameters: null

Novel Contributions

  • Full attention with 8 KV heads
  • LZMA artifact compression
  • Smaller BigramHash table (3072x112)
  • Negative result showing that reducing BigramHash size significantly hurts quality
  • Negative result showing full attention is too slow to train enough steps within the time budget