PR #905

open

Non-record: Prefix-Conditioned Suffix Diffusion — True Discrete Diffusion (diffusion_pll_bpb=1.8587)

by anthony-maioView on GitHub
val_bpb
1.8587
Architecture
Transformer
Optimizer
Artifact Size
1,679,676 bytes

Training Techniques

Architecture
weight tying
Tied input and output embeddings.
parameters: null
GQA
Uses grouped query attention with fewer KV heads than attention heads.
parameters: {"heads":4,"kv_heads":2}
RoPE
GPT-style backbone with standard positional attention setup implied by the starter code.
parameters: null
Sequence Length
sequence_length
train_length: 512
eval_length: null
Other
other
True discrete diffusion training over token sequences with absorbing-mask corruption on the suffix only, conditioned on a clean prefix.
parameters: {"diffusion_steps":8,"min_prefix":16}
other
Learned timestep embeddings and learned role embeddings for prefix versus diffused suffix.
parameters: null
other
Approximate prefix-conditioned diffusion pseudo-log-likelihood evaluation by masking the suffix and scoring the first masked token.
parameters: {"metric":"diffusion_pll_bpb"}
Compression
zlib
level: null

Novel Contributions

  • True discrete diffusion model for text rather than an autoregressive model with a diffusion-inspired auxiliary loss
  • Prefix-conditioned suffix diffusion with absorbing-mask corruption over suffix tokens only
  • Denoising loss computed only on corrupted suffix positions
  • Approximate prefix-conditioned diffusion PLL scoring metric
  • Demonstration that a discrete diffusion submission can fit within the competition's artifact constraints