PR #746

open

Seq2048 + torch.compile + mid LR (1xA100 draft)

val_bpb

1.3556

Architecture

Transformer

Optimizer

—

Artifact Size

14,840,173 bytes

Training Techniques

Architecture

KV head count

Uses fewer KV heads than attention heads in a Transformer-style model.

parameters: {"heads":8,"kv_heads":4}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmup + warmdown

parameters: {"warmup_steps":20,"warmdown_iters":1200}

Other

other

Uses torch.compile to improve throughput and effective optimization within the fixed wallclock budget.

parameters: {"enabled":true}

Increased training context length to 2048
Kept a 9-layer, 512-dimensional Transformer with 8 attention heads and 4 KV heads
Used moderately reduced learning rates
Enabled torch.compile for a significant throughput and performance gain
Demonstrated a strong single-A100 draft run under the 16MB artifact cap