PR #191

open

Record: Compression-Funded MLP3x (val_bpb=1.1598)

by chris-buckleyView on GitHub

val_bpb

1.1598

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.9 MB

Training Techniques

Quantization

int6

bits: 6

scope: all large weight matrices

fp16

bits: 16

scope: tied embeddings and last two c_k weights

Architecture

MLP3x

Widened the MLP from 2x to 3x using saved artifact budget

parameters: {"mlp_mult":3}

tied embeddings

Input and output embeddings are tied

parameters: null

KV head count

Uses fewer KV heads than attention heads

parameters: {"num_heads":8,"num_kv_heads":4}

Compression

zlib

level: null

Evaluation

sliding window eval

parameters: {"stride":256}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

LR Schedule

warmup/warmdown

parameters: {"warmup_steps":1500,"warmdown_iters":3000}

Regularization

gradient clipping

parameters: {"grad_clip_norm":0.3}

Optimizer

Muon

weight_decay: null

momentum: 0.99

other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500}

Other

other

Used seq2048 long-context training recipe with tuned learning rates and full validation split scoring

parameters: {"train_batch_tokens":786432,"matrix_lr":0.02,"scalar_lr":0.02,"tied_embed_lr":0.03}

Novel Contributions

Int6 block-weight compression to free artifact budget
Widening the MLP from 2x to 3x using the saved bytes
Keeping tied embeddings and selected attention weights in fp16 while compressing large matrices
Seq2048 long-context training recipe
Stride-256 sliding-window evaluation
Muon momentum warmup and tuned learning rates