PR #1106

open

Non-record: MDLM Diffusion — val_var_bpb 1.1465 (first diffusion to beat AR baseline)

by agalimovaView on GitHub
val_bpb
1.1465
Architecture
Transformer
Optimizer
AdamW
Artifact Size

Training Techniques

Architecture
ReLU²
Uses squared ReLU activation in the MLP.
parameters: null
RoPE
Uses rotary positional embeddings.
parameters: null
weight tying
Freezes visible-token logits / tied token-related parameters in the MDLM setup.
parameters: null
MLP3x
Uses a 3x MLP expansion.
parameters: null
Other
other
Masked diffusion language model (MDLM) training with log-linear noise schedule and discrete absorbing-mask ELBO evaluation.
parameters: null
other
AdaLN timestep conditioning via sigma embeddings modulating each layer.
parameters: null
other
Antithetic time sampling during training/evaluation.
parameters: null
Evaluation
discrete ELBO eval
parameters: {"steps":512}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmup_steps":300,"warmdown_steps":1500}
Regularization
weight decay
parameters: null

Novel Contributions

  • First discrete diffusion model to beat the AR baseline in parameter-golf.
  • Achieves 1.1465 val_var_bpb with MDLM diffusion.
  • Shows that discrete ELBO evaluation can dramatically outperform MC ELBO on the same model.
  • Identifies masking epsilon 0.1 as a major improvement over 0.001.
  • Demonstrates that some AR-oriented tricks such as LeakyReLU^2 and BigramHash do not transfer well to diffusion LMs.