PR #1139
openNon-record: AutoResearch Value Embeddings + MLP3x, 1.1801 bpb (1x RTX 4090)
by ivanontechView on GitHub
val_bpb
1.1801
Architecture
Transformer
Optimizer
Muon
Artifact Size
—
Training Techniques
Architecture
Value Residual
Learned value embeddings with gating, alternating across layers.
parameters: {"params":"31.5M"}
MLP3x
Reduced MLP expansion to 3x hidden size.
parameters: {"hidden_dim":1920}
RoPE
Rotary positional encoding.
parameters: null
KV head count
5 attention heads and 5 KV heads with MHA.
parameters: {"heads":5,"kv_heads":5}
U-Net skip connections
Residual x0 skip connections and residual lambdas.
parameters: null
SSSL
Sliding window attention pattern with 3 short and 1 long attention layer per 4 layers.
parameters: null
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Optimizer
Muon
weight_decay: null
momentum: 0.95
other_params: {"lr":0.1}
Adam
weight_decay: null
momentum: null
other_params: {"lr":0.6,"scope":"embeddings and scalars"}
Weight Averaging
EMA
parameters: {"decay":"0.995-0.998"}
Compression
zlib
level: null
Other
other
Automated ablation framework that iteratively tested architecture and hyperparameter configurations across multiple sweep rounds.
parameters: {"configs_tested":50,"sweep_rounds":5}
Novel Contributions
- Value embeddings with gating as the main performance improvement
- MLP 3x chosen over 4x to allow more training steps within the wallclock budget
- Automated ablation framework (autoresearch) for iterative architecture and hyperparameter search
- SSSL sliding window attention pattern
- Muon optimizer for matrix parameters with Adam for embeddings and scalars