PR #854

open

Non-record submission: MLP3x 9L 512d, 1.4530 bpb (1xRTX 4090)

val_bpb

1.4530

Architecture

Standard GPT / Transformer

Optimizer

Muon

Artifact Size

15.35 MB

Training Techniques

Architecture

MLP3x

Increased MLP width to 3x the default feedforward size, using a 1536 hidden dimension instead of 1024.

parameters: {"layers":9,"model_dim":512,"heads":8,"kv_heads":4,"ffn_hidden_dim":1536}

KV head count

Used grouped-query attention with fewer KV heads than attention heads.

parameters: {"heads":8,"kv_heads":4}

Quantization

int8

bits: 8

scope: all

Compression

zlib

level: null

Optimizer

Muon

weight_decay: null

momentum: null

other_params: null

Automated sweep across 11 configurations on a single RTX 4090.
Found that wider MLPs (3x) outperformed deeper stacking (12 layers) at this parameter budget.
Demonstrated competitive non-record performance on consumer hardware.
Used a 3x MLP multiplier with a 9-layer, 512-dimensional GPT-style model.