PR #882

open

Non-record: LeakyReLU(0.5)^2 + TrigramHash on PR414 stack (1.3762 bpb, 1xA100)

by IshiPareekView on GitHub
val_bpb
1.3762
Architecture
Transformer
Optimizer
Artifact Size

Training Techniques

Quantization
GPTQ-lite
bits: null
scope: all
QAT
bits: null
scope: all
Weight Averaging
EMA
parameters: null
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Architecture
LeakyReLU
Replaced ReLU² with LeakyReLU(0.5)² in the MLP to keep neurons active during training.
parameters: {"negative_slope":0.5}
TrigramHash
Groups 3 consecutive tokens into 8192 buckets before attention to add richer local context.
parameters: {"buckets":8192,"n_gram":3}

Novel Contributions

  • LeakyReLU(0.5)^2 activation in the MLP
  • TrigramHash token grouping into 8192 buckets before attention
  • Built on PR 414 stack with EMA, GPTQ-lite, warmdown3500, and QAT@0.15