PR #1582

open

Non-record: MDLM Masked Diffusion + Depth Recurrence — val_bpb 1.3428 (8×H100, seed=1337)

by He-WenhaoView on GitHub
val_bpb
1.3428
Architecture
Transformer
Optimizer
Muon
Artifact Size
14.73 MB

Training Techniques

Architecture
depth recurrence
Physical layers L1-L3 are looped once extra to increase effective depth.
parameters: {"physical_layers":9,"effective_layers":12,"recurrent_layers":3,"extra_loops":1}
ReLU²
Uses squared ReLU activation in the MLP.
parameters: {"hidden_dim":1024}
U-Net skip connections
Learned encoder-to-decoder skip connections are used.
parameters: null
GQA
Grouped query attention with kv_groups=4.
parameters: {"heads":8,"kv_groups":4}
Quantization
STE QAT
bits: 8
scope: weights
GPTQ-lite
bits: 8
scope: per-row weights
Weight Averaging
EMA
parameters: {"decay":0.997}
LR Schedule
warmdown
parameters: {"final_lr_ratio":0}
Optimizer
Muon
weight_decay: 0.01
momentum: null
other_params: {"adam_weight_decay":0}
Compression
zlib
level: 9
Other
other
Antithetic sampling is used for variance reduction during training.
parameters: null
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024

Novel Contributions

  • Depth recurrence to increase effective depth without increasing physical layers
  • STE QAT applied late in training to reduce quantization loss
  • EMA before serialization to improve post-roundtrip performance
  • GPTQ-lite percentile clip search with min-MSE selection
  • Combination of Muon warmdown, ReLU² MLP, GQA, and U-Net skip connections