PR #858

open

11L 512d Int8+Zlib Baseline (val_bpb 1.2135, 3-seed)

by nickferranteliveView on GitHub

val_bpb

1.2135

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.54 MB

Training Techniques

Architecture

depth

Increased transformer depth from the default 9 layers to 11 layers.

parameters: {"layers":11}

tied embeddings

Input and output embeddings are tied.

parameters: null

KV head count

Uses grouped-query attention with 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

Optimizer

Muon

weight_decay: null

momentum: 0.95

other_params: {"lr":0.04,"warmup_momentum_start":0.85,"warmup_steps":500}

AdamW

weight_decay: null

momentum: null

other_params: {"scope":"embeddings","lr":0.05}

AdamW

weight_decay: null

momentum: null

other_params: {"scope":"scalars","lr":0.04}

Compression

zlib

level: null

LR Schedule

warmup + warmdown

parameters: {"warmup_steps":20,"warmdown_iterations":1200}

Sequence Length

sequence_length

train_length: 1024

eval_length: null

Regularization

gradient clipping

parameters: {"clip_norm":0.3}

Novel Contributions

Scaled the baseline model from 9 to 11 transformer layers.
Demonstrated a stock baseline architecture that fits under the 16MB artifact cap using int8 quantization and zlib compression.
Reported 3-seed results with low variance on 8xH100 SXM hardware.