PR #56

open

Add Deep14x416 KV2 non-record MLX submission (val_bpb=1.8440)

by cschubinerView on GitHub
val_bpb
1.8440
Architecture
Transformer
Optimizer
Artifact Size
12,388,989 bytes

Training Techniques

Architecture
KV head count
Uses a deeper/narrower SP-1024 Transformer with reduced KV sharing via 2 KV heads.
parameters: {"layers":14,"model_dim":416,"num_heads":8,"num_kv_heads":2,"mlp_mult":2}
Quantization
int8
bits: 8
scope: all
Compression
zlib
level: null
Evaluation
logit chunking
parameters: {"logit_chunk_tokens":65536}
Other
other
Increased validation batch size to make full validation tractable on local Apple Silicon hardware.
parameters: {"val_batch_size":8388608}
Sequence Length
sequence_length
train_length: 16384
eval_length: null
LR Schedule
warmup
parameters: {"warmup_steps":10}

Novel Contributions

  • Adds a reproducible Apple Silicon MLX submission for a deeper/narrower SP-1024 configuration.
  • Explores a parameter-budget tradeoff by reducing width, increasing depth, and using fewer KV heads.
  • Documents a non-record unlimited-compute run under the 16 MB artifact cap.
  • Includes exact trainer snapshot, shard list, and training log for reproducibility.
  • Uses larger validation batch size and logit chunking to complete validation efficiently on local hardware.