PR #882

open

Non-record: LeakyReLU(0.5)^2 + TrigramHash on PR414 stack (1.3762 bpb, 1xA100)

val_bpb

1.3762

Architecture

Transformer

Optimizer

—

Artifact Size

—

Training Techniques

Quantization

GPTQ-lite

bits: null

scope: all

QAT

bits: null

scope: all

Weight Averaging

EMA

parameters: null

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Architecture

LeakyReLU

Replaced ReLU² with LeakyReLU(0.5)² in the MLP to keep neurons active during training.

parameters: {"negative_slope":0.5}

TrigramHash

Groups 3 consecutive tokens into 8192 buckets before attention to add richer local context.

parameters: {"buckets":8192,"n_gram":3}