PR #845

open

12 layers GPT | MLP_MULT reduction | VE and BIGRAM modifications

by rubenbalbastreView on GitHub

val_bpb

1.1407

Architecture

GPT

Optimizer

Parallel Muon

Artifact Size

16MB

Training Techniques

Architecture

GPT depth increase

Increased the model from 11 to 12 layers while reducing parameters elsewhere to stay near the 16MB limit.

parameters: {"layers":12}

MLP_MULT reduction

Reduced MLP width multiplier to free parameters for the extra layer.

parameters: {"mlp_mult":2.6}

Bigram embedding modification

Adjusted bigram vocabulary size and bigram embedding dimension to trade off capacity and parameter count.

parameters: {"bigram_vocab_size":2048,"bigram_dim":256}

Token embedding / VE dimension reduction

Reduced VE dimension to save parameters for the deeper model.

parameters: {"ve_dim":64}

Weight Averaging

EMA

parameters: null

Test-Time Training

LegalTTT

parameters: null

Optimizer

Parallel Muon

weight_decay: null

momentum: null

other_params: null

Extended the baseline GPT from 11 to 12 layers.
Reduced MLP_MULT to reallocate parameters to depth.
Modified bigram vocabulary size and bigram embedding dimension.
Reduced token embedding dimension (VE_DIM) to fit within the 16MB budget.
Reported multiple parameter trade-off experiments and their validation bpb results.