PR #1106

open

Non-record: MDLM Diffusion — val_var_bpb 1.1465 (first diffusion to beat AR baseline)

by agalimovaView on GitHub

val_bpb

1.1465

Architecture

Transformer

Optimizer

AdamW

Artifact Size

—

Training Techniques

Architecture

ReLU²

Uses squared ReLU activation in the MLP.

parameters: null

RoPE

Uses rotary positional embeddings.

parameters: null

weight tying

Freezes visible-token logits / tied token-related parameters in the MDLM setup.

parameters: null

MLP3x

Uses a 3x MLP expansion.

parameters: null

Other

other

Masked diffusion language model (MDLM) training with log-linear noise schedule and discrete absorbing-mask ELBO evaluation.

parameters: null

other

AdaLN timestep conditioning via sigma embeddings modulating each layer.

parameters: null

other

Antithetic time sampling during training/evaluation.

parameters: null

Evaluation

discrete ELBO eval

parameters: {"steps":512}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmdown

parameters: {"warmup_steps":300,"warmdown_steps":1500}

Regularization

weight decay

parameters: null

Novel Contributions

First discrete diffusion model to beat the AR baseline in parameter-golf.
Achieves 1.1465 val_var_bpb with MDLM diffusion.
Shows that discrete ELBO evaluation can dramatically outperform MC ELBO on the same model.
Identifies masking epsilon 0.1 as a major improvement over 0.001.
Demonstrates that some AR-oriented tricks such as LeakyReLU^2 and BigramHash do not transfer well to diffusion LMs.