PR #882
openNon-record: LeakyReLU(0.5)^2 + TrigramHash on PR414 stack (1.3762 bpb, 1xA100)
by IshiPareekView on GitHub
val_bpb
1.3762
Architecture
Transformer
Optimizer
—
Artifact Size
—
Training Techniques
Quantization
GPTQ-lite
bits: null
scope: all
QAT
bits: null
scope: all
Weight Averaging
EMA
parameters: null
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Architecture
LeakyReLU
Replaced ReLU² with LeakyReLU(0.5)² in the MLP to keep neurons active during training.
parameters: {"negative_slope":0.5}
TrigramHash
Groups 3 consecutive tokens into 8192 buckets before attention to add richer local context.
parameters: {"buckets":8192,"n_gram":3}
Novel Contributions
- LeakyReLU(0.5)^2 activation in the MLP
- TrigramHash token grouping into 8192 buckets before attention
- Built on PR 414 stack with EMA, GPTQ-lite, warmdown3500, and QAT@0.15