PR #1241

open

MDLM Diffusion — val_var_bpb 0.9901, EOS learning + full dataset shard rotation, 33M params, 1x AWS A10G

by aiejvnView on GitHub

val_bpb

0.9901

Architecture

Transformer

Optimizer

AdamW

Artifact Size

—

Training Techniques

Architecture

RoPE

Bidirectional transformer with rotary positional embeddings.

parameters: null

ReLU²

MLP uses ReLU squared activation.

parameters: null

weight tying

Embedding table padded to 1088; no explicit weight tying mentioned beyond standard architecture details.

parameters: null

Sequence Length

sequence_length

train_length: 2048

eval_length: null

LR Schedule

warmdown

parameters: {"warmup_steps":300,"warmdown_steps":1500,"schedule":"cosine decay"}

Regularization

weight decay

parameters: {"weight_decay":0.1}

Other

other

EOS token learning: token 1 is never masked and acts as a document-boundary anchor; dedicated PAD token 1025 is excluded from loss and separated from MASK token 1024.

parameters: {"eos_id":1,"pad_id":1025,"mask_id":1024}

other

Shard rotation training loader loads a subset of dataset shards at a time and rotates between shard groups to enable full FineWeb training under limited RAM.

parameters: {"shards_in_memory":4,"rotate_shards":true,"max_train_shards":0}

other

Attention head count sweep found validation BPB invariant across 2, 4, 8, 16, and 32 heads at fixed model dimension.

parameters: {"head_counts":[2,4,8,16,32],"model_dim":512}

Novel Contributions

EOS learning with document-boundary anchors during diffusion
Dedicated PAD token separate from MASK token to avoid structural padding and diffusion masking collision
Full FineWeb dataset training via shard rotation with explicit memory freeing and one-at-a-time shard loading
Finding that attention head count is effectively invariant for bidirectional diffusion LMs at fixed model dimension