PR #977

open

LeakyReLU(0.75)² + Legal TTT + Parallel Muon — 1.1185 BPB (3-seed mean)

by michaelwinczukView on GitHub

val_bpb

1.1185

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

15.96MB

Training Techniques

Architecture

LeakyReLU

Changed MLP activation negative slope from 0.5 to 0.75, with squared activation.

parameters: {"negative_slope":0.75}

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: {"matrix_lr":0.027}

Test-Time Training

full TTT

parameters: {"legal":true}

LR Schedule

warmdown

parameters: {"warmdown_iters":3700}

Evaluation

sliding window eval

parameters: null