PR #1940

open

non record submission: 11 l + int6 + tuned LR + fp16 embed (1.3066 bpb local)

val_bpb

1.3066

Architecture

Transformer

Optimizer

Muon

Artifact Size

14.99 MB

Training Techniques

Architecture

weight tying

Tied input/output embeddings with FP16 embedding passthrough.

parameters: null

GQA

Grouped query attention with 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

RoPE

Uses RoPE positional encoding.

parameters: null

ReLU²

ReLU squared MLP activation.

parameters: null

MLP3x

Explored wider MLP configurations via MLP_MULT overrides; baseline architecture uses 2x expansion and ablations include 3x.

parameters: {"mlp_mult":2}

depth

Increased transformer depth from 9 to 11 layers.

parameters: {"layers":11}

Quantization

int6

bits: 6

scope: all weights

fp16

bits: 16

scope: embeddings

Optimizer

Muon

weight_decay: null

momentum: 0.99

other_params: {"matrix_lr":0.025,"scalar_lr":0.025,"tied_embed_lr":0.035,"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}

LR Schedule

warmdown

parameters: {"warmdown_steps":3000}