PR #1403

open

Non-record: MDLM Masked Diffusion (1.3485 BPB)

by RhoahndurView on GitHub

val_bpb

1.3485

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.63 MB

Training Techniques

Architecture

weight tying

Baseline GPT uses tied embeddings.

parameters: null

U-Net skip connections

Model retains U-Net style skip connections from the baseline.

parameters: null

RoPE

Uses rotary positional embeddings.

parameters: null

ReLU²

Uses squared ReLU activation.

parameters: null

bidirectional attention

Attention is changed from causal to non-causal for masked diffusion training.

parameters: {"is_causal":false}

Quantization

int8

bits: 8

scope: all

Compression

zlib

level: null

Optimizer

Muon

weight_decay: null

momentum: null

other_params: null

Evaluation

quadrature over mask ratios

parameters: {"points":8,"range":[0.05,0.95],"rule":"trapezoidal"}

Other

other

Masked Diffusion Language Model training objective: randomly mask tokens, predict masked tokens with bidirectional context, and weight loss by 1/t as an ELBO.

parameters: {"mask_token_id":1024,"eps":0.1}

Novel Contributions

Converts the baseline autoregressive GPT into a Masked Diffusion Language Model
Uses random token masking with bidirectional attention for discrete diffusion training
Applies 1/t loss weighting to form an ELBO upper bound on NLL
Approximates evaluation over mask ratios with 8-point trapezoidal quadrature
Adds a dedicated MASK token by expanding the vocabulary from 1024 to 1025