PR #144

closed

Add MPK 8x384 10-minute submission record

val_bpb

1.3517

Architecture

MPK-style multi-path causal language model

Optimizer

—

Artifact Size

14589400 bytes

Training Techniques

Architecture

weight tying

Tied embeddings were enabled for the MPK model.

parameters: null

KV head count

Used fewer KV heads than attention heads in the MPK configuration.

parameters: {"heads":8,"kv_heads":4}

depth recurrence

MPK multi-path causal architecture with temporal strides.

parameters: {"layers":8,"width":384,"k_stride":2,"m_stride":4}

Quantization

int8

bits: 8

scope: model weights

Compression

zlib

level: null

Sequence Length

sequence_length

train_length: 1024

eval_length: 1024

Other

other

Used 80 FineWeb SP-1024 training shards and a 10-minute wallclock-limited training run.

parameters: {"train_shards":80,"wallclock_seconds":600}

Regularization

weight decay

parameters: null

Added an MPK model family implementation to the trainer
Used an 8-layer, width-384 MPK configuration with 8 attention heads and 4 KV heads
Applied MPK temporal strides k=2 and m=4
Enabled tied embeddings with tuned lower learning rates
Produced a corrected bug-fixed rerun after fixing SentencePiece leading-space marker accounting
Submitted a 10-minute wallclock-limited record candidate with int8+zlib serialization