PR #1139

open

Non-record: AutoResearch Value Embeddings + MLP3x, 1.1801 bpb (1x RTX 4090)

by ivanontechView on GitHub

val_bpb

1.1801

Architecture

Transformer

Optimizer

Muon

Artifact Size

—

Training Techniques

Architecture

Value Residual

Learned value embeddings with gating, alternating across layers.

parameters: {"params":"31.5M"}

MLP3x

Reduced MLP expansion to 3x hidden size.

parameters: {"hidden_dim":1920}

RoPE

Rotary positional encoding.

parameters: null

KV head count

5 attention heads and 5 KV heads with MHA.

parameters: {"heads":5,"kv_heads":5}

U-Net skip connections

Residual x0 skip connections and residual lambdas.

parameters: null

SSSL

Sliding window attention pattern with 3 short and 1 long attention layer per 4 layers.

parameters: null

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Optimizer

Muon

weight_decay: null

momentum: 0.95

other_params: {"lr":0.1}

Adam

weight_decay: null

momentum: null

other_params: {"lr":0.6,"scope":"embeddings and scalars"}

Weight Averaging

EMA

parameters: {"decay":"0.995-0.998"}

Compression

zlib

level: null

Other

other

Automated ablation framework that iteratively tested architecture and hyperparameter configurations across multiple sweep rounds.

parameters: {"configs_tested":50,"sweep_rounds":5}

Novel Contributions

Value embeddings with gating as the main performance improvement
MLP 3x chosen over 4x to allow more training steps within the wallclock budget
Automated ablation framework (autoresearch) for iterative architecture and hyperparameter search
SSSL sliding window attention pattern
Muon optimizer for matrix parameters with Adam for embeddings and scalars