PR #1119

open

Notable Non-Record: Text Diffusion (MDLM) — 1.4584 BPB — Masked Diffusion Language Model

by gowtham0992View on GitHub
val_bpb
1.4584
Architecture
Transformer
Optimizer
Artifact Size
11.55 MB

Training Techniques

Quantization
GPTQ
bits: 6
scope: model
Architecture
LeakyReLU
Uses LeakyReLU(0.5)^2 MLP activation in the Transformer.
parameters: {"negative_slope":0.5,"squared":true}
BigramHash
Includes BigramHash as part of the model features.
parameters: null
SmearGate
Includes SmearGate as part of the model features.
parameters: null
Evaluation
sliding window eval
parameters: null
Other
other
Masked diffusion language model training adapted for causal language modeling using continuous-time NELBO.
parameters: {"diffusion_enabled":true,"diffusion_mix":0.5}
other
Stratified t sampling for diffusion steps to reduce ELBO variance.
parameters: null
other
ELBO-weighted diffusion loss using coefficient 1/t under a linear noise schedule.
parameters: null
other
SUBS-style carry-over unmasking with loss computed only on masked positions and zero masking probabilities for the mask token.
parameters: null
Sequence Length
sequence_length
train_length: null
eval_length: null

Novel Contributions

  • First text diffusion submission using a masked diffusion language model (MDLM) for causal LM training
  • Adaptation of continuous-time NELBO masked diffusion training to a causal Transformer
  • Stratified t sampling to reduce ELBO variance
  • ELBO-weighted masked loss with zero mask probabilities
  • Mixed training with 50% diffusion steps and 50% standard next-token prediction
  • Sliding-window autoregressive evaluation after diffusion-based training