PR #1975
openNon-record: LeakyReLU2 + MuonWD + SlidingWindowEval, val_bpb=1.2111
by RishabhPrakash5View on GitHub
val_bpb
1.2111
Architecture
Transformer
Optimizer
Muon
Artifact Size
12,803,252 bytes
Training Techniques
Architecture
LeakyReLU
Replaced ReLU² with LeakyReLU(0.5)² in MLP blocks to avoid dead neurons while preserving squared activation behavior.
parameters: {"negative_slope":0.5}
weight tying
Tied embeddings are used in the baseline architecture.
parameters: null
GQA
Baseline transformer uses grouped query attention with fewer KV heads than attention heads.
parameters: {"attention_heads":8,"kv_heads":4}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.95
other_params: {"matrix_lr":0.04}
Regularization
weight decay
parameters: {"value":0.04,"applied_to":"Muon matrix parameters"}
logit softcap
parameters: {"value":30}
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
LR Schedule
warmdown
parameters: {"warmdown_iters":1200}
Novel Contributions
- LeakyReLU(0.5)² activation in MLP blocks
- Muon weight decay with WD=0.04
- Sliding window evaluation with stride=64
- Improved val_bpb to 1.2111 from a naive baseline of 1.2244