PR #1062
openNon-record: LeakyReLU(0.9)² slope sweep (local validation, compute pending)
by yaowubarbaraView on GitHub
val_bpb
1.4508
Architecture
Transformer
Optimizer
—
Artifact Size
12.7 MB
Training Techniques
Architecture
LeakyReLU
Uses LeakyReLU with negative slope 0.9 in the MLP activation, squared after activation as LeakyReLU².
parameters: {"negative_slope":0.9}
XSA
Uses XSA in the last 4 layers of the base stack.
parameters: {"layers":4}
Partial RoPE
Applies partial rotary positional embeddings.
parameters: {"range":"16/64"}
MLP3x
Uses 3x MLP blocks in the model stack.
parameters: null
GQA
Uses grouped query attention.
parameters: null
BigramHash
Uses bigram hash embeddings/features.
parameters: null
SmearGate
Uses SmearGate in the architecture.
parameters: null
Quantization
GPTQ-lite
bits: 6
scope: model weights
Compression
zstd
level: null
Weight Averaging
EMA
parameters: null
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024
Novel Contributions
- Investigates LeakyReLU negative slope 0.9 as an alternative to 0.5 for LeakyReLU² activations
- Reports local RTX 5060 validation for the PR #466 stack with slope 0.9
- Compares baseline relu² model against PR #466 stack with LeakyReLU(0.9)²
- Applies sliding window evaluation correction to the reported validation bpb
- Includes a planned slope sweep over multiple negative-slope values on full 8xH100 validation