val_bpb
1.1264
Architecture
Transformer
Optimizer
Adam
Artifact Size
14.1 MB
Training Techniques
Architecture
GQA
Grouped query attention with reduced KV heads
parameters: {"kv_heads":4}
BigramHash
BigramHash embedding with 3072x112 configuration
parameters: {"dimensions":3072,"embedding_dim":112}
Compression
lzma
level: null
Regularization
magnitude pruning
parameters: {"selective":true}
Optimizer
Adam
weight_decay: null
momentum: null
other_params: {"fused":true}
Other
other
ramfs data staging
parameters: null
Test-Time Training
score-first TTT
parameters: null
Novel Contributions
- GQA with 4 KV heads for faster steps
- LZMA compression to improve artifact packing efficiency
- Selective pruning that activates only when the artifact exceeds budget
- BigramHash 3072x112
- Score-first TTT improvement to 1.1264 BPB
- Adoption of fused Adam and ramfs data staging from merged SOTA #1019