PR #675

open

Non-record: LeakyReLU² + LAWA + Ramping WD + Val Training (val_bpb=1.2302, 1xH100)

by ChideraIbe123View on GitHub
val_bpb
1.2302
Architecture
Transformer
Optimizer
Artifact Size
13.4 MB

Training Techniques

Architecture
depth
Increased model depth from the 9-layer baseline to 10 layers.
parameters: {"layers":10}
MLP activation
Used LeakyReLU(0.5) squared in the MLP to preserve negative gradient flow.
parameters: {"negative_slope":0.5,"power":2}
Compression
lzma
level: null
Test-Time Training
validation set training
parameters: null
Weight Averaging
LAWA
parameters: {"warmdown_checkpoints":"12-13"}
Regularization
weight decay
parameters: {"start":0.02,"end":0.08,"schedule":"ramping during warmdown"}

Novel Contributions

  • Stacking LeakyReLU(0.5)^2, LAWA, ramping weight decay, and validation-set training on the baseline architecture
  • Using lzma compression instead of zlib to improve artifact size
  • Applying ramping weight decay during warmdown to improve both pre-quant quality and compression ratio
  • Exploration and negative-result documentation for recursive transformers, differential attention, value residual learning, entropy-weighted loss, and QAT