PR #144

closed

Add MPK 8x384 10-minute submission record

by DJLougenView on GitHub
val_bpb
1.3517
Architecture
MPK-style multi-path causal language model
Optimizer
Artifact Size
14589400 bytes

Training Techniques

Architecture
weight tying
Tied embeddings were enabled for the MPK model.
parameters: null
KV head count
Used fewer KV heads than attention heads in the MPK configuration.
parameters: {"heads":8,"kv_heads":4}
depth recurrence
MPK multi-path causal architecture with temporal strides.
parameters: {"layers":8,"width":384,"k_stride":2,"m_stride":4}
Quantization
int8
bits: 8
scope: model weights
Compression
zlib
level: null
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024
Other
other
Used 80 FineWeb SP-1024 training shards and a 10-minute wallclock-limited training run.
parameters: {"train_shards":80,"wallclock_seconds":600}
Regularization
weight decay
parameters: null

Novel Contributions

  • Added an MPK model family implementation to the trainer
  • Used an 8-layer, width-384 MPK configuration with 8 attention heads and 4 KV heads
  • Applied MPK temporal strides k=2 and m=4
  • Enabled tied embeddings with tuned lower learning rates
  • Produced a corrected bug-fixed rerun after fixing SentencePiece leading-space marker accounting
  • Submitted a 10-minute wallclock-limited record candidate with int8+zlib serialization