val_bpb
1.8440
Architecture
Transformer
Optimizer
—
Artifact Size
12,388,989 bytes
Training Techniques
Architecture
KV head count
Uses a deeper/narrower SP-1024 Transformer with reduced KV sharing via 2 KV heads.
parameters: {"layers":14,"model_dim":416,"num_heads":8,"num_kv_heads":2,"mlp_mult":2}
Quantization
int8
bits: 8
scope: all
Compression
zlib
level: null
Evaluation
logit chunking
parameters: {"logit_chunk_tokens":65536}
Other
other
Increased validation batch size to make full validation tractable on local Apple Silicon hardware.
parameters: {"val_batch_size":8388608}
Sequence Length
sequence_length
train_length: 16384
eval_length: null
LR Schedule
warmup
parameters: {"warmup_steps":10}
Novel Contributions
- Adds a reproducible Apple Silicon MLX submission for a deeper/narrower SP-1024 configuration.
- Explores a parameter-budget tradeoff by reducing width, increasing depth, and using fewer KV heads.
- Documents a non-record unlimited-compute run under the 16 MB artifact cap.
- Includes exact trainer snapshot, shard list, and training log for reproducibility.
- Uses larger validation batch size and logit chunking to complete validation efficiently on local hardware.