PR #185

open

Non-record: Wider-shallower 4x768 + QAT (1xH100, 1.3043 bpb)

val_bpb

1.3043

Architecture

Transformer

Optimizer

Muon

Artifact Size

—

Training Techniques

Architecture

wider-shallower Transformer

Uses a 4-layer, 768-dimensional model with grouped-query attention to improve performance at matched wallclock.

parameters: {"layers":4,"dimensions":768,"heads":12,"kv_heads":4}

KV head count

Grouped-query attention with fewer KV heads than attention heads.

parameters: {"heads":12,"kv_heads":4}

Quantization

QAT

bits: 8

scope: model weights

STE QAT

bits: 8

scope: model weights

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"lr":0.06,"grad_clip":0.5,"beta2":0.99}