PR #1124

open

Record: 1.1194 BPB — v9 Batched Muon + Full GPTQ Random Calib + JEPA Research

by NewyorkDevView on GitHub
val_bpb
1.1194
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.90 MB

Training Techniques

Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"batched_newton_schulz_orthogonalization":true,"torch_bmm":true,"shape_matched_batches":4,"weight_matrices_grouped":66}
Quantization
GPTQ
bits: null
scope: full model
Other
other
Random token calibration for GPTQ to collect Hessians without training data access
parameters: null
Architecture
XSA
Uses XSA across all layers
parameters: {"last_n":11}
Evaluation
sliding window eval
parameters: {"stride":64}
Regularization
label smoothing
parameters: {"value":0}
Test-Time Training
score-first TTT
parameters: null

Novel Contributions

  • Batched Newton-Schulz orthogonalization via torch.bmm for Muon speedup
  • Full GPTQ with random token calibration without training data access
  • JEPA/STP ablation research showing auxiliary losses hurt at this scale
  • Discovery and fix of a label smoothing evaluation bug
  • Finding that score-first TTT is ineffective when XSA covers all layers