PR #675

open

Non-record: LeakyReLU² + LAWA + Ramping WD + Val Training (val_bpb=1.2302, 1xH100)

by ChideraIbe123View on GitHub

val_bpb

1.2302

Architecture

Transformer

Optimizer

—

Artifact Size

13.4 MB

Training Techniques

Architecture

depth

Increased model depth from the 9-layer baseline to 10 layers.

parameters: {"layers":10}

MLP activation

Used LeakyReLU(0.5) squared in the MLP to preserve negative gradient flow.

parameters: {"negative_slope":0.5,"power":2}

Compression

lzma

level: null

Test-Time Training

validation set training

parameters: null

Weight Averaging

LAWA

parameters: {"warmdown_checkpoints":"12-13"}

Regularization

weight decay

parameters: {"start":0.02,"end":0.08,"schedule":"ramping during warmdown"}

Stacking LeakyReLU(0.5)^2, LAWA, ramping weight decay, and validation-set training on the baseline architecture
Using lzma compression instead of zlib to improve artifact size
Applying ramping weight decay during warmdown to improve both pre-quant quality and compression ratio
Exploration and negative-result documentation for recursive transformers, differential attention, value residual learning, entropy-weighted loss, and QAT