PR #1062

open

Non-record: LeakyReLU(0.9)² slope sweep (local validation, compute pending)

by yaowubarbaraView on GitHub

val_bpb

1.4508

Architecture

Transformer

Optimizer

—

Artifact Size

12.7 MB

Training Techniques

Architecture

LeakyReLU

Uses LeakyReLU with negative slope 0.9 in the MLP activation, squared after activation as LeakyReLU².

parameters: {"negative_slope":0.9}

XSA

Uses XSA in the last 4 layers of the base stack.

parameters: {"layers":4}

Partial RoPE

Applies partial rotary positional embeddings.

parameters: {"range":"16/64"}

MLP3x

Uses 3x MLP blocks in the model stack.

parameters: null

GQA

Uses grouped query attention.

parameters: null

BigramHash

Uses bigram hash embeddings/features.

parameters: null

SmearGate

Uses SmearGate in the architecture.

parameters: null

Quantization

GPTQ-lite

bits: 6

scope: model weights

Compression

zstd

level: null

Weight Averaging

EMA

parameters: null

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: 1024

eval_length: 1024

Novel Contributions

Investigates LeakyReLU negative slope 0.9 as an alternative to 0.5 for LeakyReLU² activations
Reports local RTX 5060 validation for the PR #466 stack with slope 0.9
Compares baseline relu² model against PR #466 stack with LeakyReLU(0.9)²
Applies sliding window evaluation correction to the reported validation bpb
Includes a planned slope sweep over multiple negative-slope values on full 8xH100 validation