PR #1418
openNon-record: Autoresearch-Guided Optimization — 100+ Experiments + Negative Results
by Park-Tae-HwanView on GitHub
val_bpb
1.4192
Architecture
Transformer
Optimizer
Muon
Artifact Size
—
Training Techniques
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"parallel_variant":"Parallel Muon","matrix_lr":0.04,"muon_wd":0.08}
Parallel Muon
weight_decay: null
momentum: null
other_params: {"row_normalization":true}
Regularization
logit softcap
parameters: {"softcap":15}
LR Schedule
warmdown
parameters: {"warmdown_steps":4000}
Quantization
GPTQ
bits: 4
scope: model weights
QAT
bits: 4
scope: model weights
int6
bits: 6
scope: model weights
Architecture
ReLU²
Uses relu squared activation in the MLP; reported as best among tested activations.
parameters: null
GELU
Alternative MLP activation tested and found worse than ReLU².
parameters: null
SwiGLU
Alternative MLP activation tested and found worse than ReLU².
parameters: null
Hadamard rotation
Applied Hadamard rotation before GPTQ quantization (HadaGPTQ / PolarQuant-style).
parameters: null
Initialization
init_std
Tuned initialization standard deviation; helped with AdamW but did not transfer to Muon.
Novel Contributions
- Autoresearch-guided hyperparameter exploration with 100+ automated experiments
- Validated QK-Gain=3.0 improvement on H100 with real Muon optimizer
- Documented negative results for MuonEq-R with Parallel Muon, int4 quantization, Hadamard rotation, and several hyperparameter settings
- Identified that row-normalizing momentum interacts badly with banked/sharded optimizer state in distributed training
- Compared Mac-based AdamW sweeps with H100 Muon sweeps to assess transferability of findings