PR #56

open

Add Deep14x416 KV2 non-record MLX submission (val_bpb=1.8440)

val_bpb

1.8440

Architecture

Transformer

Optimizer

—

Artifact Size

12,388,989 bytes

Training Techniques

Architecture

KV head count

Uses a deeper/narrower SP-1024 Transformer with reduced KV sharing via 2 KV heads.

parameters: {"layers":14,"model_dim":416,"num_heads":8,"num_kv_heads":2,"mlp_mult":2}

Quantization

int8

bits: 8

scope: all

Compression

zlib

level: null

Evaluation

logit chunking

parameters: {"logit_chunk_tokens":65536}

Other

other

Increased validation batch size to make full validation tractable on local Apple Silicon hardware.

parameters: {"val_batch_size":8388608}

Sequence Length

sequence_length

train_length: 16384

eval_length: null

LR Schedule

warmup

parameters: {"warmup_steps":10}

Adds a reproducible Apple Silicon MLX submission for a deeper/narrower SP-1024 configuration.
Explores a parameter-budget tradeoff by reducing width, increasing depth, and using fewer KV heads.
Documents a non-record unlimited-compute run under the 16 MB artifact cap.
Includes exact trainer snapshot, shard list, and training log for reproducibility.
Uses larger validation batch size and logit chunking to complete validation efficiently on local hardware.