PR #1141
openNon-record: AutoResearch Value Embeddings + MLP3x, 1.1801 bpb (1x RTX 4090)
by ivanontechView on GitHub
val_bpb
1.1801
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.35 MB
Training Techniques
Architecture
MLP3x
Uses a 3x MLP multiplier to widen the feedforward layers.
parameters: {"mlp_multiplier":3}
Value Residual
Adds value embeddings / value embeddings-style residual features to improve performance.
parameters: {"parameters":31500000}
GQA
Uses grouped query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
Compression
zlib
level: null
Quantization
int8
bits: 8
scope: all
Sequence Length
sequence_length
train_length: null
eval_length: 8192
Novel Contributions
- Automated ablation framework with 50+ configurations across 5 sweep rounds
- Finding that value embeddings provide about a 0.19 bpb improvement over baseline
- Demonstration that MLP3x outperforms deeper models at this parameter scale
- Competitive non-record submission trained on a single RTX 4090