PR #1031
openRecord: MTP-2 Funnel + LeakyReLU(0.75)² + Legal TTT + Parallel Muon
by michaelwinczukView on GitHub
val_bpb
1.1185
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15.95MB
Training Techniques
Architecture
LeakyReLU
Changed MLP activation negative slope from 0.5 to 0.75 and used squared activation.
parameters: {"negative_slope":0.75}
MTP
Added multi-token prediction auxiliary heads to predict 2 tokens ahead during training; heads are discarded at export.
parameters: {"num_heads":2,"loss_weight":0.1}
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: {"MATRIX_LR":0.027}
LR Schedule
warmdown
parameters: {"warmdown_steps":3700}
Test-Time Training
Legal TTT
parameters: null
Evaluation
sliding window eval
parameters: null
Novel Contributions
- Added multi-token prediction (MTP) auxiliary training signal with 2 heads
- Reduced MTP loss weight to 0.1 to avoid overpowering the main CE loss
- Increased LeakyReLU negative slope from 0.5 to 0.75
- Tuned MATRIX_LR from 0.025 to 0.027
- Extended warmdown from 3500 to 3700 iterations
- Used legal test-time training and sliding window evaluation