PR #446

open

Record: 7L MLP3x 4kSeq LR-Tuned (val_bpb=1.1933)

by sofiabodView on GitHub

val_bpb

1.1933

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.77 MB

Training Techniques

Architecture

MLP3x

Uses a 3x MLP expansion width (hidden=1536) instead of the baseline 2x width.

parameters: {"layers":7,"width":512,"hidden":1536,"attention_heads":8,"kv_heads":4}

tied embeddings

Input and output embeddings are tied.

parameters: null

KV head count

Uses grouped-query attention with fewer KV heads than attention heads.

parameters: {"attention_heads":8,"kv_heads":4}

extra RMSNorm

Adds an extra RMSNorm before attention and MLP output projections.

parameters: null

Sequence Length

sequence_length

train_length: 4096

eval_length: null

LR Schedule

warmdown

parameters: {"warmdown_iters":3000}

Other

other

Lower tied embedding learning rate to smooth weight distributions and reduce quantization gap.

parameters: {"tied_embed_lr":0.01}

other

Lower matrix learning rate for smoother Muon updates.

parameters: {"matrix_lr":0.03}

other

Reduced logit softcap to tighten output distribution and help quantization.

parameters: {"logit_softcap":15}

other

Adjusted qk gain initialization.

parameters: {"qk_gain_init":1}

Quantization

int8

bits: 8

scope: all

Compression

zlib

level: null

Novel Contributions

7-layer 512-dim transformer with 3x MLP width while keeping roughly the same parameter count as baseline
Training at sequence length 4096 for improved per-step quality
Lower tied embedding learning rate to create smoother weights and dramatically reduce quantization gap
Carefully tuned learning rates for matrix parameters and embeddings
Reduced logit softcap to improve both training and quantization
Longer warmdown schedule for better generalization
Standard int8 quantization with zlib compression and no QAT