val_bpb
1.3693
Architecture
Transformer
Optimizer
—
Artifact Size
9,668,102 bytes
Training Techniques
Architecture
tied embeddings
Uses tied input/output embeddings to reduce artifact size.
parameters: null
depth/width tradeoff
Uses a compact Transformer with reduced width and increased depth to improve compression/quality tradeoff under the size cap.
parameters: {"layers":12,"model_dim":384,"num_heads":6,"num_kv_heads":3,"mlp_mult":2}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Compression
zlib
level: null
Novel Contributions
- Compact 12-layer, 384-dimension Transformer configuration under a 10-minute wallclock budget on 1x H100
- Width reduction with added depth to explore a size/quality tradeoff under the 16MB artifact cap
- Tied embeddings to reduce serialized model size
- Non-record negative-result datapoint comparing artifact size against a stronger baseline