PR #1031

open

Record: MTP-2 Funnel + LeakyReLU(0.75)² + Legal TTT + Parallel Muon

by michaelwinczukView on GitHub
val_bpb
1.1185
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15.95MB

Training Techniques

Architecture
LeakyReLU
Changed MLP activation negative slope from 0.5 to 0.75 and used squared activation.
parameters: {"negative_slope":0.75}
MTP
Added multi-token prediction auxiliary heads to predict 2 tokens ahead during training; heads are discarded at export.
parameters: {"num_heads":2,"loss_weight":0.1}
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: {"MATRIX_LR":0.027}
LR Schedule
warmdown
parameters: {"warmdown_steps":3700}
Test-Time Training
Legal TTT
parameters: null
Evaluation
sliding window eval
parameters: null

Novel Contributions

  • Added multi-token prediction (MTP) auxiliary training signal with 2 heads
  • Reduced MTP loss weight to 0.1 to avoid overpowering the main CE loss
  • Increased LeakyReLU negative slope from 0.5 to 0.75
  • Tuned MATRIX_LR from 0.025 to 0.027
  • Extended warmdown from 3500 to 3700 iterations
  • Used legal test-time training and sliding window evaluation