val_bpb
1.3509
Architecture
Transformer
Optimizer
—
Artifact Size
14301562 bytes
Training Techniques
Architecture
tied embeddings
Input and output embeddings are tied to reduce parameters and artifact size.
parameters: null
KV head count
Uses fewer key/value heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
depth/narrow transformer
Uses a deeper but narrower Transformer layout compared with the naive baseline.
parameters: {"layers":12,"model_dim":416}
Quantization
int8
bits: 8
scope: model weights
Compression
zlib
level: null
Sequence Length
sequence_length
train_length: 1024
eval_length: null
LR Schedule
warmdown
parameters: {"warmup_steps":20,"warmdown_iters":1200}
Other
other
10-minute wallclock-limited training run on 8xH100 GPUs.
parameters: {"max_wallclock_seconds":600,"num_gpus":8}
Novel Contributions
- Deeper/narrower Transformer configuration (12 layers, 416 model dim)
- Reduced KV head count (8 attention heads, 4 KV heads)
- Tied input/output embeddings
- 10-minute 8xH100 training run under the 16MB track limit
- Final artifact compressed with int8 + zlib