PR #1053

open

Submission/2026 03 28 masked diffusion

by ikermoelView on GitHub
val_bpb
1.3600
Architecture
Transformer
Optimizer
Artifact Size
~12.9MB

Training Techniques

Architecture
bidirectional attention
Uses bidirectional attention during training so each masked token can attend to all other tokens.
parameters: null
Other
other
Discrete masked diffusion language model (MDLM) training objective with masked token prediction and pseudo-log-likelihood evaluation.
parameters: {"mask_rate_range":[0.15,0.85],"eval_mask_rate":0.5,"eval_passes":8}

Novel Contributions

  • Discrete masked diffusion language model (MDLM)
  • Bidirectional attention during training
  • Masked token prediction with CE loss only on masked positions
  • Pseudo-log-likelihood evaluation using multiple masked forward passes