PR #675
openNon-record: LeakyReLU² + LAWA + Ramping WD + Val Training (val_bpb=1.2302, 1xH100)
by ChideraIbe123View on GitHub
val_bpb
1.2302
Architecture
Transformer
Optimizer
—
Artifact Size
13.4 MB
Training Techniques
Architecture
depth
Increased model depth from the 9-layer baseline to 10 layers.
parameters: {"layers":10}
MLP activation
Used LeakyReLU(0.5) squared in the MLP to preserve negative gradient flow.
parameters: {"negative_slope":0.5,"power":2}
Compression
lzma
level: null
Test-Time Training
validation set training
parameters: null
Weight Averaging
LAWA
parameters: {"warmdown_checkpoints":"12-13"}
Regularization
weight decay
parameters: {"start":0.02,"end":0.08,"schedule":"ramping during warmdown"}
Novel Contributions
- Stacking LeakyReLU(0.5)^2, LAWA, ramping weight decay, and validation-set training on the baseline architecture
- Using lzma compression instead of zlib to improve artifact size
- Applying ramping weight decay during warmdown to improve both pre-quant quality and compression ratio
- Exploration and negative-result documentation for recursive transformers, differential attention, value residual learning, entropy-weighted loss, and QAT