PR #1051
openWIP: LeakyReLU(0.5)² MLP on 11L EMA + GPTQ-lite stack (`track_10min_16mb`)
by tejas-goyalView on GitHub
val_bpb
1.2826
Architecture
Transformer
Optimizer
—
Artifact Size
7,804,166 bytes
Training Techniques
Architecture
MLP3x
Replaces ReLU-squared MLP activation with LeakyReLU(negative_slope=0.5) followed by square() in the 3x MLP.
parameters: {"negative_slope":0.5}
Partial RoPE
Uses partial rotary positional embeddings as part of the parent stack.
parameters: null
XSA
Includes XSA attention modification from the parent stack.
parameters: null
VE128
Includes VE128 component from the parent stack.
parameters: null
SmearGate
Includes SmearGate component from the parent stack.
parameters: null
BigramHash
Includes BigramHash component from the parent stack.
parameters: null
Weight Averaging
EMA
parameters: null
Quantization
GPTQ-lite
bits: 6
scope: model weights
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Regularization
LN scale
parameters: null
Novel Contributions
- Swaps the parent record's ReLU-squared MLP activation for LeakyReLU(0.5)-squared with no extra parameters.
- Builds on an 11-layer EMA + GPTQ-lite + warmdown3500 + QAT@0.15 stack.
- Provides a WIP submission folder with smoke-run logs and reproducible training/export scripts.
- Uses GPTQ-lite int6 export and sliding-window evaluation for the 10-minute / 16 MB track.