val_bpb
1.3556
Architecture
Transformer
Optimizer
—
Artifact Size
14,840,173 bytes
Training Techniques
Architecture
KV head count
Uses fewer KV heads than attention heads in a Transformer-style model.
parameters: {"heads":8,"kv_heads":4}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmup + warmdown
parameters: {"warmup_steps":20,"warmdown_iters":1200}
Other
other
Uses torch.compile to improve throughput and effective optimization within the fixed wallclock budget.
parameters: {"enabled":true}
Novel Contributions
- Increased training context length to 2048
- Kept a 9-layer, 512-dimensional Transformer with 8 attention heads and 4 KV heads
- Used moderately reduced learning rates
- Enabled torch.compile for a significant throughput and performance gain
- Demonstrated a strong single-A100 draft run under the 16MB artifact cap