PR #941

open

submission: LeakyReLU² + EMA + BigramHash(20480) + MLP3.5x

by aptsaltView on GitHub
val_bpb
1.3620
Architecture
Transformer
Optimizer
Muon
Artifact Size

Training Techniques

Architecture
LeakyReLU
LeakyReLU(0.5) squared activation
parameters: {"squared":true,"negative_slope":0.5}
Partial RoPE
Partial rotary positional embedding applied to a subset of dimensions
parameters: {"dimensions":16,"total_dimensions":64}
BigramHash
Larger bigram vocabulary for token hashing
parameters: {"vocab_size":20480}
MLP3x
Wider MLP hidden dimension
parameters: {"multiplier":3.5}
KV head count
Attention uses 8 heads and 4 KV heads
parameters: {"heads":8,"kv_heads":4}
Weight Averaging
EMA
parameters: {"decay":0.997}
Quantization
late QAT
bits: null
scope: null
Regularization
LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"adamw_for_scalars_embeddings":true}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}

Novel Contributions

  • LeakyReLU(0.5) squared activation
  • EMA weight averaging instead of SWA
  • Late QAT
  • Partial RoPE with 16/64 dimensions
  • LN scale regularization
  • BigramHash with 20480 vocabulary size
  • MLP width multiplier of 3.5x
  • Mixed int5/int6 quantization
  • zstd-22 artifact compression