PR #905
openNon-record: Prefix-Conditioned Suffix Diffusion — True Discrete Diffusion (diffusion_pll_bpb=1.8587)
by anthony-maioView on GitHub
val_bpb
1.8587
Architecture
Transformer
Optimizer
—
Artifact Size
1,679,676 bytes
Training Techniques
Architecture
weight tying
Tied input and output embeddings.
parameters: null
GQA
Uses grouped query attention with fewer KV heads than attention heads.
parameters: {"heads":4,"kv_heads":2}
RoPE
GPT-style backbone with standard positional attention setup implied by the starter code.
parameters: null
Sequence Length
sequence_length
train_length: 512
eval_length: null
Other
other
True discrete diffusion training over token sequences with absorbing-mask corruption on the suffix only, conditioned on a clean prefix.
parameters: {"diffusion_steps":8,"min_prefix":16}
other
Learned timestep embeddings and learned role embeddings for prefix versus diffused suffix.
parameters: null
other
Approximate prefix-conditioned diffusion pseudo-log-likelihood evaluation by masking the suffix and scoring the first masked token.
parameters: {"metric":"diffusion_pll_bpb"}
Compression
zlib
level: null
Novel Contributions
- True discrete diffusion model for text rather than an autoregressive model with a diffusion-inspired auxiliary loss
- Prefix-conditioned suffix diffusion with absorbing-mask corruption over suffix tokens only
- Denoising loss computed only on corrupted suffix positions
- Approximate prefix-conditioned diffusion PLL scoring metric
- Demonstration that a discrete diffusion submission can fit within the competition's artifact constraints