PR #1256
openNon-record: LeakyReLU(0.5)^2 on SmearGate + BigramHash + Int6 stack (1.1444 bpb)
by oidebrettView on GitHub
val_bpb
1.1444
Architecture
Transformer
Optimizer
Muon
Artifact Size
~16.0 MB
Training Techniques
Architecture
SmearGate
Uses SmearGate in the model stack.
parameters: null
BigramHash
Adds BigramHash embeddings.
parameters: {"dimensions":128,"hash_size":4096}
MLP3x
Uses a 3x MLP expansion.
parameters: null
LeakyReLU
Uses LeakyReLU(0.5)^2 instead of ReLU^2 in the MLP activation.
parameters: {"negative_slope":0.5}
Quantization
QAT
bits: 6
scope: all
Weight Averaging
SWA
parameters: {"checkpoints_averaged":30}
Compression
zstd
level: 22
Initialization
OrthoInit
Orthogonal initialization.
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: null
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Novel Contributions
- Replaced ReLU^2 with LeakyReLU(0.5)^2 in the MLP activation
- Built on the SmearGate + BigramHash + Int6 QAT + SWA stack
- Reported a small improvement over the base stack (1.1459 to 1.1444)