PR #1100

closed

Non-record: LLaDA-MDLM Diffusion — val_var_bpb 1.1465 (first diffusion to beat AR baseline)

by agalimovaView on GitHub
val_bpb
1.1465
Architecture
Transformer
Optimizer
AdamW
Artifact Size

Training Techniques

Architecture
bidirectional transformer
11-layer 512-dim bidirectional transformer used as a masked diffusion language model
parameters: {"layers":11,"dimensions":512,"heads":8}
adaLN
Timestep conditioning via adaptive layer norm / sigma embeddings modulating each layer
parameters: null
RoPE
Rotary positional embeddings
parameters: null
ReLU²
Squared ReLU MLP activation
parameters: null
frozen visible-token logits
Frozen visible-token logits in substitution parameterization for MDLM
parameters: null
Optimizer
AdamW
weight_decay: null
momentum: null
other_params: {"lr":0.0006,"warmup_steps":300}
LR Schedule
warmdown
parameters: {"warmdown_steps":1500}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Regularization
weight decay
parameters: null
Other
other
Masked diffusion language modeling with MDLM continuous-time ELBO and log-linear noise schedule
parameters: null
other
Antithetic time sampling to reduce variance
parameters: null
Evaluation
discrete absorbing-mask ELBO
parameters: {"eval_steps":512}

Novel Contributions

  • First discrete diffusion model to beat the AR baseline in parameter-golf
  • Use of MDLM masked diffusion with log-linear noise schedule
  • Proper discrete absorbing-mask ELBO evaluation instead of Monte Carlo ELBO sampling
  • Finding that higher masking epsilon (0.1) substantially improves diffusion LM performance
  • Observation that wider architectures outperform deeper ones at fixed parameter count
  • Demonstration that several AR tricks do not transfer well to diffusion models