PR #1975

open

Non-record: LeakyReLU2 + MuonWD + SlidingWindowEval, val_bpb=1.2111

by RishabhPrakash5View on GitHub

val_bpb

1.2111

Architecture

Transformer

Optimizer

Muon

Artifact Size

12,803,252 bytes

Training Techniques

Architecture

LeakyReLU

Replaced ReLU² with LeakyReLU(0.5)² in MLP blocks to avoid dead neurons while preserving squared activation behavior.

parameters: {"negative_slope":0.5}

weight tying

Tied embeddings are used in the baseline architecture.

parameters: null

GQA

Baseline transformer uses grouped query attention with fewer KV heads than attention heads.

parameters: {"attention_heads":8,"kv_heads":4}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.95

other_params: {"matrix_lr":0.04}

Regularization

weight decay

parameters: {"value":0.04,"applied_to":"Muon matrix parameters"}

logit softcap

parameters: {"value":30}

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: 1024

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_iters":1200}