PR #1907
openNewton-Muon × PR #1874's document-packed loader: a controlled negative result
by GodlyDonutsView on GitHub
val_bpb
1.1071
Architecture
Transformer
Optimizer
Muon
Artifact Size
16MB each
Training Techniques
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"newton_schulz_preconditioning":true,"hook_based_trigger":true}
Quantization
GPTQ
bits: 6
scope: all
Evaluation
sliding window eval
parameters: null
Test-Time Training
TTT
parameters: {"phased":true}
Other
other
Document-packed loader with variable cu_seqlens per step
parameters: {"document_packing":true,"variable_cu_seqlens":true}
Novel Contributions
- Controlled negative result showing Newton-Muon regresses on PR #1874's document-packed loader
- Root-cause analysis of torch.compile recompilation caused by hook-based integer state in nn.Module
- Demonstration that forward-pre-hook Newton-Schulz preconditioning conflicts with variable-length document packing and fullgraph compilation
- Shipped trained artifacts and logs for reproducible verification of the regression