PR #1792
openRecord: Polar Express NS + MIN_LR + GatedAttn + Alpha LoRA — val_bpb 1.07006 (3-seed mean)
by renqianluoView on GitHub
val_bpb
1.0701
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.98MB
Training Techniques
Optimizer
Muon
weight_decay: 1
momentum: null
other_params: {"backend_steps":5}
Architecture
Gated Attention
Adds gated attention to the model stack.
parameters: null
weight tying
Not explicitly mentioned in this PR body.
parameters: null
Test-Time Training
LoRA TTT
parameters: {"rank":128,"alpha":144}
Quantization
int8
bits: 8
scope: per-row gate
LR Schedule
warmdown
parameters: {"min_lr":0.1}
Other
other
Polar Express Newton-Schulz coefficients: replaces fixed coefficients with per-iteration minimax-optimal tuples for better polar factor approximation.
parameters: {"iterations":5}
other
Tight budget polish by reducing GPTQ reserve time and disabling periodic validation loss to reclaim training time.
parameters: {"gptq_reserve_seconds":0.5,"val_loss_every":0}
other
Phased test-time training with multiple phases.
parameters: {"num_phases":3}
Novel Contributions
- Polar Express Newton-Schulz coefficients with per-iteration minimax-optimal tuples
- MIN_LR=0.10 warmdown floor
- Tight budget polish via reduced GPTQ reserve and disabled periodic validation loss
- Stacking Polar Express NS, Gated Attention, and Alpha LoRA improvements on top of prior submission