PR #1907

open

Newton-Muon × PR #1874's document-packed loader: a controlled negative result

by GodlyDonutsView on GitHub
val_bpb
1.1071
Architecture
Transformer
Optimizer
Muon
Artifact Size
16MB each

Training Techniques

Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"newton_schulz_preconditioning":true,"hook_based_trigger":true}
Quantization
GPTQ
bits: 6
scope: all
Evaluation
sliding window eval
parameters: null
Test-Time Training
TTT
parameters: {"phased":true}
Other
other
Document-packed loader with variable cu_seqlens per step
parameters: {"document_packing":true,"variable_cu_seqlens":true}

Novel Contributions

  • Controlled negative result showing Newton-Muon regresses on PR #1874's document-packed loader
  • Root-cause analysis of torch.compile recompilation caused by hook-based integer state in nn.Module
  • Demonstration that forward-pre-hook Newton-Schulz preconditioning conflicts with variable-length document packing and fullgraph compilation
  • Shipped trained artifacts and logs for reproducible verification of the regression