PR #820

open

[non-record] Masked Diffusion Language Model (val_var_bpb=1.625)

by mtybadgerView on GitHub
val_bpb
1.6252
Architecture
Bidirectional Transformer
Optimizer
Artifact Size
15,379,114 bytes

Training Techniques

Architecture
bidirectional transformer
Replaces the autoregressive causal next-token model with a masked diffusion language model using bidirectional denoising and iterative sampling.
parameters: {"layers":9,"model_dim":512,"num_heads":8,"num_kv_heads":4,"mlp_mult":2}
KV head count
Uses GQA-style grouped query attention with fewer KV heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
adaLN timestep conditioning
Adds timestep-conditioned denoiser conditioning via adaLN-style conditioning.
parameters: {"cond_dim":128}
Sequence Length
sequence_length
train_length: 256
eval_length: null
Regularization
dropout
parameters: {"rate":0}
Quantization
mixed int6/int8
bits: 6
scope: model weights
Compression
zstd
level: 22
Evaluation
variational bound evaluation with discrete absorbing-mask process
parameters: {"var_eval_steps":32}
Other
other
Masked diffusion language modeling objective with continuous-time SUBS denoising loss and iterative DDPM-style sampling cache.
parameters: {"sampler":"ddpm_cache","sampling_schedule":"linear","sampling_steps":256}

Novel Contributions

  • Replaces the autoregressive GPT baseline with a masked diffusion language model.
  • Uses a bidirectional masked denoising objective instead of causal next-token prediction.
  • Introduces timestep-conditioned adaLN denoiser conditioning.
  • Reports a variational BPB metric based on a discrete absorbing-mask upper bound.
  • Fits the submission under the 16MB limit using mixed int6/int8 quantization plus zstd-22 compression.