PR #1031

open

Record: MTP-2 Funnel + LeakyReLU(0.75)² + Legal TTT + Parallel Muon

by michaelwinczukView on GitHub

val_bpb

1.1185

Architecture

Transformer

Optimizer

Parallel Muon

Artifact Size

15.95MB

Training Techniques

Architecture

LeakyReLU

Changed MLP activation negative slope from 0.5 to 0.75 and used squared activation.

parameters: {"negative_slope":0.75}

MTP

Added multi-token prediction auxiliary heads to predict 2 tokens ahead during training; heads are discarded at export.

parameters: {"num_heads":2,"loss_weight":0.1}

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: {"MATRIX_LR":0.027}

LR Schedule

warmdown

parameters: {"warmdown_steps":3700}

Test-Time Training

Legal TTT

parameters: null

Evaluation

sliding window eval

parameters: null