PR #1131

open

Improve baseline with LeakyReLU² activation

val_bpb

1.2947

Architecture

Transformer

Optimizer

—

Artifact Size

—

Training Techniques

Architecture

LeakyReLU

Replaced ReLU² with LeakyReLU(0.5)² in the MLP forward pass to preserve negative gradient flow while keeping squared outputs.

parameters: {"negative_slope":0.5}

Replaced ReLU² with LeakyReLU(0.5)² in the MLP forward pass
Preserved negative gradient flow while maintaining squared output characteristic
Reported improved validation bpb over the ReLU² baseline