PR #1036

open

Non-record: AutoResearch Batch Optimization — 1.1974 bpb (1× RTX 4090)

by ivanontechView on GitHub
val_bpb
1.1974
Architecture
Transformer
Optimizer
Muon
Artifact Size

Training Techniques

Architecture
weight tying
Tied input and output embeddings.
parameters: null
Gated Attention
Value embeddings with gated fusion on alternating layers and learned value representations per token.
parameters: {"layers":12}
KV head count
Used 8 attention heads and 8 KV heads with full attention.
parameters: {"heads":8,"kv_heads":8}
Optimizer
Muon
weight_decay: 0.2
momentum: null
other_params: {"matrix_lr":0.1,"adamw_for":"embeddings/scalars"}
LR Schedule
warmdown
parameters: {"schedule":"cosine warmdown","warmup_ratio":0}
Regularization
weight decay
parameters: {"value":0.2}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Other
other
Aggressive reduction of total batch size from 2^19 to 2^16 to increase the number of optimization steps within a fixed 5-minute wallclock budget.
parameters: {"total_batch_size":65536,"baseline_batch_size":524288,"steps":404}

Novel Contributions

  • Reduced total batch size from 2^19 to 2^16 to dramatically increase training steps within the same time budget
  • Automated hyperparameter search across three rounds of experiments
  • Value embeddings with gated fusion on alternating layers
  • Use of Muon optimizer with AdamW for embeddings/scalars
  • Demonstrated competitive performance on a single RTX 4090