PR #1141

open

Non-record: AutoResearch Value Embeddings + MLP3x, 1.1801 bpb (1x RTX 4090)

val_bpb

1.1801

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.35 MB

Training Techniques

Architecture

MLP3x

Uses a 3x MLP multiplier to widen the feedforward layers.

parameters: {"mlp_multiplier":3}

Value Residual

Adds value embeddings / value embeddings-style residual features to improve performance.

parameters: {"parameters":31500000}

GQA

Uses grouped query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

Compression

zlib

level: null

Quantization

int8

bits: 8

scope: all

Sequence Length

sequence_length

train_length: null

eval_length: 8192

Automated ablation framework with 50+ configurations across 5 sweep rounds
Finding that value embeddings provide about a 0.19 bpb improvement over baseline
Demonstration that MLP3x outperforms deeper models at this parameter scale
Competitive non-record submission trained on a single RTX 4090