PR #1418

open

Non-record: Autoresearch-Guided Optimization — 100+ Experiments + Negative Results

by Park-Tae-HwanView on GitHub

val_bpb

1.4192

Architecture

Transformer

Optimizer

Muon

Artifact Size

—

Training Techniques

Optimizer

Muon

weight_decay: null

momentum: 0.99

other_params: {"parallel_variant":"Parallel Muon","matrix_lr":0.04,"muon_wd":0.08}

Parallel Muon

weight_decay: null

momentum: null

other_params: {"row_normalization":true}

Regularization

logit softcap

parameters: {"softcap":15}

LR Schedule

warmdown

parameters: {"warmdown_steps":4000}

Quantization

GPTQ

bits: 4

scope: model weights

QAT

bits: 4

scope: model weights

int6

bits: 6

scope: model weights

Architecture

ReLU²

Uses relu squared activation in the MLP; reported as best among tested activations.

parameters: null

GELU

Alternative MLP activation tested and found worse than ReLU².

parameters: null

SwiGLU

Alternative MLP activation tested and found worse than ReLU².

parameters: null

Hadamard rotation

Applied Hadamard rotation before GPTQ quantization (HadaGPTQ / PolarQuant-style).

parameters: null

Initialization

init_std

Tuned initialization standard deviation; helped with AdamW but did not transfer to Muon.

Novel Contributions

Autoresearch-guided hyperparameter exploration with 100+ automated experiments
Validated QK-Gain=3.0 improvement on H100 with real Muon optimizer
Documented negative results for MuonEq-R with Parallel Muon, int4 quantization, Hadamard rotation, and several hyperparameter settings
Identified that row-normalizing momentum interacts badly with banked/sharded optimizer state in distributed training
Compared Mac-based AdamW sweeps with H100 Muon sweeps to assess transferability of findings