PR #746

open

Seq2048 + torch.compile + mid LR (1xA100 draft)

val_bpb
1.3556
Architecture
Transformer
Optimizer
Artifact Size
14,840,173 bytes

Training Techniques

Architecture
KV head count
Uses fewer KV heads than attention heads in a Transformer-style model.
parameters: {"heads":8,"kv_heads":4}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmup + warmdown
parameters: {"warmup_steps":20,"warmdown_iters":1200}
Other
other
Uses torch.compile to improve throughput and effective optimization within the fixed wallclock budget.
parameters: {"enabled":true}

Novel Contributions

  • Increased training context length to 2048
  • Kept a 9-layer, 512-dimensional Transformer with 8 attention heads and 4 KV heads
  • Used moderately reduced learning rates
  • Enabled torch.compile for a significant throughput and performance gain
  • Demonstrated a strong single-A100 draft run under the 16MB artifact cap