PR #1139

open

Non-record: AutoResearch Value Embeddings + MLP3x, 1.1801 bpb (1x RTX 4090)

by ivanontechView on GitHub
val_bpb
1.1801
Architecture
Transformer
Optimizer
Muon
Artifact Size

Training Techniques

Architecture
Value Residual
Learned value embeddings with gating, alternating across layers.
parameters: {"params":"31.5M"}
MLP3x
Reduced MLP expansion to 3x hidden size.
parameters: {"hidden_dim":1920}
RoPE
Rotary positional encoding.
parameters: null
KV head count
5 attention heads and 5 KV heads with MHA.
parameters: {"heads":5,"kv_heads":5}
U-Net skip connections
Residual x0 skip connections and residual lambdas.
parameters: null
SSSL
Sliding window attention pattern with 3 short and 1 long attention layer per 4 layers.
parameters: null
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Optimizer
Muon
weight_decay: null
momentum: 0.95
other_params: {"lr":0.1}
Adam
weight_decay: null
momentum: null
other_params: {"lr":0.6,"scope":"embeddings and scalars"}
Weight Averaging
EMA
parameters: {"decay":"0.995-0.998"}
Compression
zlib
level: null
Other
other
Automated ablation framework that iteratively tested architecture and hyperparameter configurations across multiple sweep rounds.
parameters: {"configs_tested":50,"sweep_rounds":5}

Novel Contributions

  • Value embeddings with gating as the main performance improvement
  • MLP 3x chosen over 4x to allow more training steps within the wallclock budget
  • Automated ablation framework (autoresearch) for iterative architecture and hyperparameter search
  • SSSL sliding window attention pattern
  • Muon optimizer for matrix parameters with Adam for embeddings and scalars