PR #1582

open

Non-record: MDLM Masked Diffusion + Depth Recurrence — val_bpb 1.3428 (8×H100, seed=1337)

by He-WenhaoView on GitHub

val_bpb

1.3428

Architecture

Transformer

Optimizer

Muon

Artifact Size

14.73 MB

Training Techniques

Architecture

depth recurrence

Physical layers L1-L3 are looped once extra to increase effective depth.

parameters: {"physical_layers":9,"effective_layers":12,"recurrent_layers":3,"extra_loops":1}

ReLU²

Uses squared ReLU activation in the MLP.

parameters: {"hidden_dim":1024}

U-Net skip connections

Learned encoder-to-decoder skip connections are used.

parameters: null

GQA

Grouped query attention with kv_groups=4.

parameters: {"heads":8,"kv_groups":4}

Quantization

STE QAT

bits: 8

scope: weights

GPTQ-lite

bits: 8

scope: per-row weights

Weight Averaging

EMA

parameters: {"decay":0.997}

LR Schedule

warmdown

parameters: {"final_lr_ratio":0}

Optimizer

Muon

weight_decay: 0.01

momentum: null

other_params: {"adam_weight_decay":0}

Compression

zlib

level: 9

Other

other

Antithetic sampling is used for variance reduction during training.

parameters: null

Sequence Length

sequence_length

train_length: 1024

eval_length: 1024

Novel Contributions

Depth recurrence to increase effective depth without increasing physical layers
STE QAT applied late in training to reduce quantization loss
EMA before serialization to improve post-roundtrip performance
GPTQ-lite percentile clip search with min-MSE selection
Combination of Muon warmdown, ReLU² MLP, GQA, and U-Net skip connections