PR #1699

open

Non-record: 19.2M MDLM Text Diffusion: fp8 e4m3 + EMA 0.999 + Muon LR 0.02

by lsbView on GitHub

val_bpb

1.4831

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.83 MB

Training Techniques

Quantization

fp8 e4m3

bits: 8

scope: all

Architecture

GQA

Bidirectional masked diffusion transformer with grouped query attention and non-causal attention.

parameters: {"layers":8,"dim":576,"heads":8,"kv_heads":4}

weight tying

Embedding table is tied with the output head, excluding the [MASK] token from prediction vocabulary.

parameters: {"vocab_size_plus_mask":1025}

U-Net skip connections

Transformer uses U-Net style encoder-decoder skip connections with learned skip weights.

parameters: {"encoder_layers":4,"decoder_layers":4}

ReLU²

MLP uses relu^2 activation with 2x expansion.

parameters: {"mlp_hidden_dim":1152}

RoPE

Uses rotary positional embeddings.

parameters: {"base":10000}

Weight Averaging

EMA

parameters: {"decay":0.999,"start_frac":0.1,"shadow_dtype":"fp32"}

Optimizer

Muon

weight_decay: null

momentum: 0.95

other_params: {"lr":0.02,"newton_schulz_steps":5}

Adam

weight_decay: null

momentum: null

other_params: {"used_for":"embeddings and scalars","lr":0.05,"beta1":0.9,"beta2":0.95}

Regularization

logit softcap

parameters: {"value":30}

Compression

zlib

level: 9

lzma

level: 6

Sequence Length

sequence_length

train_length: 1024

eval_length: null

Other

other

Low-discrepancy stratified time sampling during training for diffusion masking.

parameters: {"t_range":[0.001,1]}

Evaluation

chain-rule eval

parameters: {"variants":["left-to-right","confidence-order"]}

Novel Contributions

First text diffusion submission to Parameter Golf
Bidirectional masked diffusion LM (MDLM) for text
fp8 e4m3 quantization to fit a 19.2M parameter model into ~15.8MB
EMA with fp32 shadow to avoid bf16 freezing at high decay
Muon learning rate reduction to 0.02 for the MDLM objective
Comparison of NELBO, left-to-right chain-rule, and confidence-order chain-rule evaluation
Empirical analysis of the diffusion-vs-autoregressive BPB gap across prefix lengths