PR #1250
openNon-record: Full Attention + LZMA + small BigramHash (val_bpb=1.2094)
by ibarrajoView on GitHub
val_bpb
1.2094
Architecture
Transformer
Optimizer
—
Artifact Size
12.3 MB
Training Techniques
Architecture
KV head count
Uses full attention with 8 KV heads instead of GQA.
parameters: {"kv_heads":8}
BigramHash
Uses a smaller BigramHash embedding table.
parameters: {"dimensions":"3072x112"}
Compression
lzma
level: null
Test-Time Training
TTT
parameters: null
Novel Contributions
- Full attention with 8 KV heads
- LZMA artifact compression
- Smaller BigramHash table (3072x112)
- Negative result showing that reducing BigramHash size significantly hurts quality
- Negative result showing full attention is too slow to train enough steps within the time budget