PR #1403

open

Non-record: MDLM Masked Diffusion (1.3485 BPB)

by RhoahndurView on GitHub
val_bpb
1.3485
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.63 MB

Training Techniques

Architecture
weight tying
Baseline GPT uses tied embeddings.
parameters: null
U-Net skip connections
Model retains U-Net style skip connections from the baseline.
parameters: null
RoPE
Uses rotary positional embeddings.
parameters: null
ReLU²
Uses squared ReLU activation.
parameters: null
bidirectional attention
Attention is changed from causal to non-causal for masked diffusion training.
parameters: {"is_causal":false}
Quantization
int8
bits: 8
scope: all
Compression
zlib
level: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
Evaluation
quadrature over mask ratios
parameters: {"points":8,"range":[0.05,0.95],"rule":"trapezoidal"}
Other
other
Masked Diffusion Language Model training objective: randomly mask tokens, predict masked tokens with bidirectional context, and weight loss by 1/t as an ELBO.
parameters: {"mask_token_id":1024,"eps":0.1}

Novel Contributions

  • Converts the baseline autoregressive GPT into a Masked Diffusion Language Model
  • Uses random token masking with bidirectional attention for discrete diffusion training
  • Applies 1/t loss weighting to form an ELBO upper bound on NLL
  • Approximates evaluation over mask ratios with 8-point trapezoidal quadrature
  • Adds a dedicated MASK token by expanding the vocabulary from 1024 to 1025