val_bpb
1.2058
Architecture
Transformer
Optimizer
—
Artifact Size
15867270 bytes
Training Techniques
Architecture
tied embeddings
Input and output embeddings are tied.
parameters: null
KV head count
Uses fewer KV heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Compression
zlib
level: null
Novel Contributions
- Long-context training at sequence length 2048
- Tied input/output embeddings
- Reduced KV head count (8 attention heads, 4 KV heads)
- Standalone record script with baked-in defaults
- Int8 + zlib roundtrip serialization for the final submission artifact