PR #858

open

11L 512d Int8+Zlib Baseline (val_bpb 1.2135, 3-seed)

by nickferranteliveView on GitHub
val_bpb
1.2135
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.54 MB

Training Techniques

Architecture
depth
Increased transformer depth from the default 9 layers to 11 layers.
parameters: {"layers":11}
tied embeddings
Input and output embeddings are tied.
parameters: null
KV head count
Uses grouped-query attention with 8 attention heads and 4 KV heads.
parameters: {"heads":8,"kv_heads":4}
Optimizer
Muon
weight_decay: null
momentum: 0.95
other_params: {"lr":0.04,"warmup_momentum_start":0.85,"warmup_steps":500}
AdamW
weight_decay: null
momentum: null
other_params: {"scope":"embeddings","lr":0.05}
AdamW
weight_decay: null
momentum: null
other_params: {"scope":"scalars","lr":0.04}
Compression
zlib
level: null
LR Schedule
warmup + warmdown
parameters: {"warmup_steps":20,"warmdown_iterations":1200}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Regularization
gradient clipping
parameters: {"clip_norm":0.3}

Novel Contributions

  • Scaled the baseline model from 9 to 11 transformer layers.
  • Demonstrated a stock baseline architecture that fits under the 16MB artifact cap using int8 quantization and zlib compression.
  • Reported 3-seed results with low variance on 8xH100 SXM hardware.