PR #1792

open

Record: Polar Express NS + MIN_LR + GatedAttn + Alpha LoRA — val_bpb 1.07006 (3-seed mean)

by renqianluoView on GitHub

val_bpb

1.0701

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.98MB

Training Techniques

Optimizer

Muon

weight_decay: 1

momentum: null

other_params: {"backend_steps":5}

Architecture

Gated Attention

Adds gated attention to the model stack.

parameters: null

weight tying

Not explicitly mentioned in this PR body.

parameters: null

Test-Time Training

LoRA TTT

parameters: {"rank":128,"alpha":144}

Quantization

int8

bits: 8

scope: per-row gate

LR Schedule

warmdown

parameters: {"min_lr":0.1}

Other

other

Polar Express Newton-Schulz coefficients: replaces fixed coefficients with per-iteration minimax-optimal tuples for better polar factor approximation.

parameters: {"iterations":5}

other

Tight budget polish by reducing GPTQ reserve time and disabling periodic validation loss to reclaim training time.

parameters: {"gptq_reserve_seconds":0.5,"val_loss_every":0}

other

Phased test-time training with multiple phases.

parameters: {"num_phases":3}

Novel Contributions

Polar Express Newton-Schulz coefficients with per-iteration minimax-optimal tuples
MIN_LR=0.10 warmdown floor
Tight budget polish via reduced GPTQ reserve and disabled periodic validation loss
Stacking Polar Express NS, Gated Attention, and Alpha LoRA improvements on top of prior submission