val_bpb
1.2355
Architecture
Transformer
Optimizer
—
Artifact Size
15.87MB
Training Techniques
Quantization
int8
bits: 8
scope: all
Architecture
GQA
Grouped-query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
tied embeddings
Input and output embeddings are tied.
parameters: null
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Compression
zlib
level: null
Novel Contributions
- 9-layer 512-dim transformer baseline
- 1024-vocab SentencePiece BPE tokenizer
- Grouped-query attention with 8 heads and 4 KV heads
- Tied embeddings
- Training on 80 FineWeb shards (~8B tokens)
- int8+zlib artifact packaging