PR #1119

open

Notable Non-Record: Text Diffusion (MDLM) — 1.4584 BPB — Masked Diffusion Language Model

val_bpb

1.4584

Architecture

Transformer

Optimizer

—

Artifact Size

11.55 MB

Training Techniques

Quantization

GPTQ

bits: 6

scope: model

Architecture

LeakyReLU

Uses LeakyReLU(0.5)^2 MLP activation in the Transformer.

parameters: {"negative_slope":0.5,"squared":true}

BigramHash

Includes BigramHash as part of the model features.

parameters: null

SmearGate

Includes SmearGate as part of the model features.

parameters: null

Evaluation

sliding window eval

parameters: null

Other

other

Masked diffusion language model training adapted for causal language modeling using continuous-time NELBO.

parameters: {"diffusion_enabled":true,"diffusion_mix":0.5}

other

Stratified t sampling for diffusion steps to reduce ELBO variance.

parameters: null

other

ELBO-weighted diffusion loss using coefficient 1/t under a linear noise schedule.

parameters: null

other

SUBS-style carry-over unmasking with loss computed only on masked positions and zero masking probabilities for the mask token.

parameters: null

Sequence Length

sequence_length

train_length: null

eval_length: null

First text diffusion submission using a masked diffusion language model (MDLM) for causal LM training
Adaptation of continuous-time NELBO masked diffusion training to a causal Transformer
Stratified t sampling to reduce ELBO variance
ELBO-weighted masked loss with zero mask probabilities
Mixed training with 50% diffusion steps and 50% standard next-token prediction
Sliding-window autoregressive evaluation after diffusion-based training