val_bpb
1.3600
Architecture
Transformer
Optimizer
—
Artifact Size
~12.9MB
Training Techniques
Architecture
bidirectional attention
Uses bidirectional attention during training so each masked token can attend to all other tokens.
parameters: null
Other
other
Discrete masked diffusion language model (MDLM) training objective with masked token prediction and pseudo-log-likelihood evaluation.
parameters: {"mask_rate_range":[0.15,0.85],"eval_mask_rate":0.5,"eval_passes":8}
Novel Contributions
- Discrete masked diffusion language model (MDLM)
- Bidirectional attention during training
- Masked token prediction with CE loss only on masked positions
- Pseudo-log-likelihood evaluation using multiple masked forward passes