val_bpb
1.2058
Architecture
Transformer
Optimizer
—
Artifact Size
15,538,222 bytes
Training Techniques
Architecture
weight tying
Tied input and output embeddings.
parameters: null
GQA
Uses grouped-query attention with fewer KV heads than query heads.
parameters: {"num_heads":8,"num_kv_heads":4}
attention modification
Depth-scheduled local/global attention pattern with local windows followed by full attention.
parameters: {"pattern":"40,80,full"}
SwiGLU
Replaced standard MLP blocks with SwiGLU feedforward blocks.
parameters: {"mlp_mult":1.625}
sequence packing
Randomized sequence packing with synchronized per-step offsets across DDP ranks.
parameters: null
Quantization
mixed int6/int8
bits: 6
scope: attn.proj.weight int6, elsewhere int8
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_frac":0.75}
Other
other
SP-1536 tokenizer and dataset variant fineweb10B_sp1536.
parameters: {"vocab_size":1536}
other
Periodic validation every 1000 steps on the full validation split.
parameters: {"val_every_steps":1000}
Novel Contributions
- Depth-scheduled local/global attention transformer
- SwiGLU feedforward blocks
- Randomized sequence packing with synchronized offsets across DDP ranks
- Selective mixed-bit quantization with int6 attention output projections
- SP-1536 tokenizer and dataset variant