val_bpb
1.3485
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.63 MB
Training Techniques
Architecture
weight tying
Baseline GPT uses tied embeddings.
parameters: null
U-Net skip connections
Model retains U-Net style skip connections from the baseline.
parameters: null
RoPE
Uses rotary positional embeddings.
parameters: null
ReLU²
Uses squared ReLU activation.
parameters: null
bidirectional attention
Attention is changed from causal to non-causal for masked diffusion training.
parameters: {"is_causal":false}
Quantization
int8
bits: 8
scope: all
Compression
zlib
level: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
Evaluation
quadrature over mask ratios
parameters: {"points":8,"range":[0.05,0.95],"rule":"trapezoidal"}
Other
other
Masked Diffusion Language Model training objective: randomly mask tokens, predict masked tokens with bidirectional context, and weight loss by 1/t as an ELBO.
parameters: {"mask_token_id":1024,"eps":0.1}
Novel Contributions
- Converts the baseline autoregressive GPT into a Masked Diffusion Language Model
- Uses random token masking with bidirectional attention for discrete diffusion training
- Applies 1/t loss weighting to form an ELBO upper bound on NLL
- Approximates evaluation over mask ratios with 8-point trapezoidal quadrature
- Adds a dedicated MASK token by expanding the vocabulary from 1024 to 1025