PR #1699

open

Non-record: 19.2M MDLM Text Diffusion: fp8 e4m3 + EMA 0.999 + Muon LR 0.02

val_bpb
1.4831
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.83 MB

Training Techniques

Quantization
fp8 e4m3
bits: 8
scope: all
Architecture
GQA
Bidirectional masked diffusion transformer with grouped query attention and non-causal attention.
parameters: {"layers":8,"dim":576,"heads":8,"kv_heads":4}
weight tying
Embedding table is tied with the output head, excluding the [MASK] token from prediction vocabulary.
parameters: {"vocab_size_plus_mask":1025}
U-Net skip connections
Transformer uses U-Net style encoder-decoder skip connections with learned skip weights.
parameters: {"encoder_layers":4,"decoder_layers":4}
ReLU²
MLP uses relu^2 activation with 2x expansion.
parameters: {"mlp_hidden_dim":1152}
RoPE
Uses rotary positional embeddings.
parameters: {"base":10000}
Weight Averaging
EMA
parameters: {"decay":0.999,"start_frac":0.1,"shadow_dtype":"fp32"}
Optimizer
Muon
weight_decay: null
momentum: 0.95
other_params: {"lr":0.02,"newton_schulz_steps":5}
Adam
weight_decay: null
momentum: null
other_params: {"used_for":"embeddings and scalars","lr":0.05,"beta1":0.9,"beta2":0.95}
Regularization
logit softcap
parameters: {"value":30}
Compression
zlib
level: 9
lzma
level: 6
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Other
other
Low-discrepancy stratified time sampling during training for diffusion masking.
parameters: {"t_range":[0.001,1]}
Evaluation
chain-rule eval
parameters: {"variants":["left-to-right","confidence-order"]}

Novel Contributions

  • First text diffusion submission to Parameter Golf
  • Bidirectional masked diffusion LM (MDLM) for text
  • fp8 e4m3 quantization to fit a 19.2M parameter model into ~15.8MB
  • EMA with fp32 shadow to avoid bf16 freezing at high decay
  • Muon learning rate reduction to 0.02 for the MDLM objective
  • Comparison of NELBO, left-to-right chain-rule, and confidence-order chain-rule evaluation
  • Empirical analysis of the diffusion-vs-autoregressive BPB gap across prefix lengths