PR #1859

open

Add 10L LeakyReLU + Gated Attention + Value Residual record (1.1454)

by suchihypeView on GitHub
val_bpb
1.1454
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.65 MB

Training Techniques

Architecture
LeakyReLU
Uses leaky ReLU squared MLP activation instead of ReLU squared.
parameters: {"negative_slope":0.5,"squared":true}
Gated Attention
Applies a sigmoid output gate after attention output projection.
parameters: {"gate_bias_init":2}
Value Residual
Blends each block's value tensor with the first block's value tensor using a learnable scalar.
parameters: {"alpha_init":0.9}
U-Net skip connections
Uses encoder-decoder skip connections in the transformer backbone.
parameters: {"encoder_layers":5,"decoder_layers":5}
GQA
Uses grouped query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
BigramHash
Adds a learned bigram hash feature.
parameters: {"buckets":4096,"dim":128}
SmearGate
Uses SmearGate in the architecture.
parameters: null
Regularization
logit softcap
parameters: {"value":30}
LN scale
parameters: null
Optimizer
Muon
weight_decay: 0.045
momentum: 0.99
other_params: {"lr":0.035,"warmup_momentum_start":0.92}
AdamW
weight_decay: 0.01
momentum: null
other_params: {"embed_lr":0.045,"scalar_lr":0.035,"betas":[0.9,0.95],"eps":1e-8}
Quantization
late QAT
bits: 6
scope: all
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_steps":2000}

Novel Contributions

  • LeakyReLU(0.5)^2 MLP activation
  • Gated Attention with sigmoid output gate and +2 bias initialization
  • Value Residual Learning with learnable alpha initialized to 0.9
  • Stacking three orthogonal improvements on top of the PR #583 baseline
  • Sliding window evaluation with stride 64
  • Quantized and compressed submission under the 16 MB cap