val_bpb
1.3517
Architecture
MPK-style multi-path causal language model
Optimizer
—
Artifact Size
14589400 bytes
Training Techniques
Architecture
weight tying
Tied embeddings were enabled for the MPK model.
parameters: null
KV head count
Used fewer KV heads than attention heads in the MPK configuration.
parameters: {"heads":8,"kv_heads":4}
depth recurrence
MPK multi-path causal architecture with temporal strides.
parameters: {"layers":8,"width":384,"k_stride":2,"m_stride":4}
Quantization
int8
bits: 8
scope: model weights
Compression
zlib
level: null
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024
Other
other
Used 80 FineWeb SP-1024 training shards and a 10-minute wallclock-limited training run.
parameters: {"train_shards":80,"wallclock_seconds":600}
Regularization
weight decay
parameters: null
Novel Contributions
- Added an MPK model family implementation to the trainer
- Used an 8-layer, width-384 MPK configuration with 8 attention heads and 4 KV heads
- Applied MPK temporal strides k=2 and m=4
- Enabled tied embeddings with tuned lower learning rates
- Produced a corrected bug-fixed rerun after fixing SentencePiece leading-space marker accounting
- Submitted a 10-minute wallclock-limited record candidate with int8+zlib serialization