PR #1241

open

MDLM Diffusion — val_var_bpb 0.9901, EOS learning + full dataset shard rotation, 33M params, 1x AWS A10G

val_bpb
0.9901
Architecture
Transformer
Optimizer
AdamW
Artifact Size

Training Techniques

Architecture
RoPE
Bidirectional transformer with rotary positional embeddings.
parameters: null
ReLU²
MLP uses ReLU squared activation.
parameters: null
weight tying
Embedding table padded to 1088; no explicit weight tying mentioned beyond standard architecture details.
parameters: null
Sequence Length
sequence_length
train_length: 2048
eval_length: null
LR Schedule
warmdown
parameters: {"warmup_steps":300,"warmdown_steps":1500,"schedule":"cosine decay"}
Regularization
weight decay
parameters: {"weight_decay":0.1}
Other
other
EOS token learning: token 1 is never masked and acts as a document-boundary anchor; dedicated PAD token 1025 is excluded from loss and separated from MASK token 1024.
parameters: {"eos_id":1,"pad_id":1025,"mask_id":1024}
other
Shard rotation training loader loads a subset of dataset shards at a time and rotates between shard groups to enable full FineWeb training under limited RAM.
parameters: {"shards_in_memory":4,"rotate_shards":true,"max_train_shards":0}
other
Attention head count sweep found validation BPB invariant across 2, 4, 8, 16, and 32 heads at fixed model dimension.
parameters: {"head_counts":[2,4,8,16,32],"model_dim":512}

Novel Contributions

  • EOS learning with document-boundary anchors during diffusion
  • Dedicated PAD token separate from MASK token to avoid structural padding and diffusion masking collision
  • Full FineWeb dataset training via shard rotation with explicit memory freeing and one-at-a-time shard loading
  • Finding that attention head count is effectively invariant for bidirectional diffusion LMs at fixed model dimension