PR #1036

open

Non-record: AutoResearch Batch Optimization — 1.1974 bpb (1× RTX 4090)

val_bpb

1.1974

Architecture

Transformer

Optimizer

Muon

Artifact Size

—

Training Techniques

Architecture

weight tying

Tied input and output embeddings.

parameters: null

Gated Attention

Value embeddings with gated fusion on alternating layers and learned value representations per token.

parameters: {"layers":12}

KV head count

Used 8 attention heads and 8 KV heads with full attention.

parameters: {"heads":8,"kv_heads":8}

Optimizer

Muon

weight_decay: 0.2

momentum: null

other_params: {"matrix_lr":0.1,"adamw_for":"embeddings/scalars"}

LR Schedule

warmdown

parameters: {"schedule":"cosine warmdown","warmup_ratio":0}

Regularization

weight decay

parameters: {"value":0.2}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Other

other

Aggressive reduction of total batch size from 2^19 to 2^16 to increase the number of optimization steps within a fixed 5-minute wallclock budget.

parameters: {"total_batch_size":65536,"baseline_batch_size":524288,"steps":404}

Reduced total batch size from 2^19 to 2^16 to dramatically increase training steps within the same time budget
Automated hyperparameter search across three rounds of experiments
Value embeddings with gated fusion on alternating layers
Use of Muon optimizer with AdamW for embeddings/scalars
Demonstrated competitive performance on a single RTX 4090