PR #46

closed

Optimized SOTA Submission: 1.2697 bpb

val_bpb

1.2697

Architecture

Transformer

Optimizer

Muon

Artifact Size

11.0MB

Training Techniques

Quantization

int8

bits: 8

scope: all

Architecture

KV head count

Used a 9-layer, 432-dim Transformer with efficient GQA and reduced KV heads for better parameter efficiency.

parameters: {"layers":9,"dim":432,"heads":8,"kv_heads":2,"mlp_mult":2}

Optimizer

Muon

weight_decay: null

momentum: 0.9

other_params: {"beta1":0.85,"beta2":0.98,"grad_clip":1}

Compression

zlib

level: null

Sequence Length

sequence_length

train_length: 1024

eval_length: null

LR Schedule

linear warmup

parameters: {"warmup_steps":100}

Regularization

gradient clipping

parameters: {"norm":1}

Other

other

Larger batch training with systematic hyperparameter tuning and full 10-minute wallclock utilization.

parameters: {"train_batch_tokens":786432,"max_wallclock_seconds":600,"experiments":8}