PR #905

open

Non-record: Prefix-Conditioned Suffix Diffusion — True Discrete Diffusion (diffusion_pll_bpb=1.8587)

by anthony-maioView on GitHub

val_bpb

1.8587

Architecture

Transformer

Optimizer

—

Artifact Size

1,679,676 bytes

Training Techniques

Architecture

weight tying

Tied input and output embeddings.

parameters: null

GQA

Uses grouped query attention with fewer KV heads than attention heads.

parameters: {"heads":4,"kv_heads":2}

RoPE

GPT-style backbone with standard positional attention setup implied by the starter code.

parameters: null

Sequence Length

sequence_length

train_length: 512

eval_length: null

Other

other

True discrete diffusion training over token sequences with absorbing-mask corruption on the suffix only, conditioned on a clean prefix.

parameters: {"diffusion_steps":8,"min_prefix":16}

other

Learned timestep embeddings and learned role embeddings for prefix versus diffused suffix.

parameters: null

other

Approximate prefix-conditioned diffusion pseudo-log-likelihood evaluation by masking the suffix and scoring the first masked token.

parameters: {"metric":"diffusion_pll_bpb"}

Compression

zlib

level: null

True discrete diffusion model for text rather than an autoregressive model with a diffusion-inspired auxiliary loss
Prefix-conditioned suffix diffusion with absorbing-mask corruption over suffix tokens only
Denoising loss computed only on corrupted suffix positions
Approximate prefix-conditioned diffusion PLL scoring metric
Demonstration that a discrete diffusion submission can fit within the competition's artifact constraints