PR #1119
openNotable Non-Record: Text Diffusion (MDLM) — 1.4584 BPB — Masked Diffusion Language Model
by gowtham0992View on GitHub
val_bpb
1.4584
Architecture
Transformer
Optimizer
—
Artifact Size
11.55 MB
Training Techniques
Quantization
GPTQ
bits: 6
scope: model
Architecture
LeakyReLU
Uses LeakyReLU(0.5)^2 MLP activation in the Transformer.
parameters: {"negative_slope":0.5,"squared":true}
BigramHash
Includes BigramHash as part of the model features.
parameters: null
SmearGate
Includes SmearGate as part of the model features.
parameters: null
Evaluation
sliding window eval
parameters: null
Other
other
Masked diffusion language model training adapted for causal language modeling using continuous-time NELBO.
parameters: {"diffusion_enabled":true,"diffusion_mix":0.5}
other
Stratified t sampling for diffusion steps to reduce ELBO variance.
parameters: null
other
ELBO-weighted diffusion loss using coefficient 1/t under a linear noise schedule.
parameters: null
other
SUBS-style carry-over unmasking with loss computed only on masked positions and zero masking probabilities for the mask token.
parameters: null
Sequence Length
sequence_length
train_length: null
eval_length: null
Novel Contributions
- First text diffusion submission using a masked diffusion language model (MDLM) for causal LM training
- Adaptation of continuous-time NELBO masked diffusion training to a causal Transformer
- Stratified t sampling to reduce ELBO variance
- ELBO-weighted masked loss with zero mask probabilities
- Mixed training with 50% diffusion steps and 50% standard next-token prediction
- Sliding-window autoregressive evaluation after diffusion-based training