PR #1748

open

basic submission improving baseline

by elad-simbalistaView on GitHub
val_bpb
1.2098
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,872,012 bytes

Training Techniques

Sequence Length
sequence_length
train_length: 2048
eval_length: null
Optimizer
Muon
weight_decay: null
momentum: 0.985
other_params: {"warmup_from":0.9,"warmup_steps":500}
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}
Weight Averaging
EMA
parameters: {"decay":0.997}
Quantization
GPTQ-lite
bits: 8
scope: per-row
Architecture
weight tying
Tied input and output embeddings.
parameters: null
KV head count
Used grouped-query style attention with fewer KV heads than query heads.
parameters: {"num_heads":8,"num_kv_heads":4}

Novel Contributions

  • Longer training context length
  • Muon momentum warmup
  • Extended warmdown schedule
  • EMA weight averaging
  • Per-row GPTQ-lite int8 quantization
  • Wallclock-aware training schedule
  • Tied embeddings
  • Reduced KV head count