PR #1859

open

Add 10L LeakyReLU + Gated Attention + Value Residual record (1.1454)

by suchihypeView on GitHub

val_bpb

1.1454

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.65 MB

Training Techniques

Architecture

LeakyReLU

Uses leaky ReLU squared MLP activation instead of ReLU squared.

parameters: {"negative_slope":0.5,"squared":true}

Gated Attention

Applies a sigmoid output gate after attention output projection.

parameters: {"gate_bias_init":2}

Value Residual

Blends each block's value tensor with the first block's value tensor using a learnable scalar.

parameters: {"alpha_init":0.9}

U-Net skip connections

Uses encoder-decoder skip connections in the transformer backbone.

parameters: {"encoder_layers":5,"decoder_layers":5}

GQA

Uses grouped query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

BigramHash

Adds a learned bigram hash feature.

parameters: {"buckets":4096,"dim":128}

SmearGate

Uses SmearGate in the architecture.

parameters: null

Regularization

logit softcap

parameters: {"value":30}

LN scale

parameters: null

Optimizer

Muon

weight_decay: 0.045

momentum: 0.99

other_params: {"lr":0.035,"warmup_momentum_start":0.92}

AdamW

weight_decay: 0.01

momentum: null

other_params: {"embed_lr":0.045,"scalar_lr":0.035,"betas":[0.9,0.95],"eps":1e-8}

Quantization

late QAT

bits: 6

scope: all

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

LR Schedule

warmdown

parameters: {"warmdown_steps":2000}

Novel Contributions

LeakyReLU(0.5)^2 MLP activation
Gated Attention with sigmoid output gate and +2 bias initialization
Value Residual Learning with learnable alpha initialized to 0.9
Stacking three orthogonal improvements on top of the PR #583 baseline
Sliding window evaluation with stride 64
Quantized and compressed submission under the 16 MB cap