PR #1141

open

Non-record: AutoResearch Value Embeddings + MLP3x, 1.1801 bpb (1x RTX 4090)

by ivanontechView on GitHub
val_bpb
1.1801
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.35 MB

Training Techniques

Architecture
MLP3x
Uses a 3x MLP multiplier to widen the feedforward layers.
parameters: {"mlp_multiplier":3}
Value Residual
Adds value embeddings / value embeddings-style residual features to improve performance.
parameters: {"parameters":31500000}
GQA
Uses grouped query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
Compression
zlib
level: null
Quantization
int8
bits: 8
scope: all
Sequence Length
sequence_length
train_length: null
eval_length: 8192

Novel Contributions

  • Automated ablation framework with 50+ configurations across 5 sweep rounds
  • Finding that value embeddings provide about a 0.19 bpb improvement over baseline
  • Demonstration that MLP3x outperforms deeper models at this parameter scale
  • Competitive non-record submission trained on a single RTX 4090