A2D: Any-Order, Any-Step Safety Alignment for Diffusion Language Models

Abstract

Diffusion large language models (dLLMs) enable any-order generation, but this flexibility enlarges the attack surface: harmful spans may appear at arbitrary positions, and template-based prefilling attacks (e.g., DIJA) can bypass response-level refusals. We introduce A2D (Any-Order, Any-Step Defense), a token-level alignment method that trains dLLMs to emit an [EOS] refusal signal whenever harmful content arises. By aligning safety directly at masked token positions under randomized masking, A2D remains robust across any decoding order and any step of diffusion generation. On safety benchmarks, A2D reduces DIJA success rates from >80% to near-zero (e.g., 1.3% on LLaDA-8B-Instruct; 0.0% on Dream-v0-Instruct-7B) and enables early rejection with up to 19.3× faster safe termination.

Disclaimer: This project discusses adversarial prompting and unsafe requests for the purpose of safety research. We do not provide harmful instructions.

Key idea

Token-level [EOS]

Replace targets in harmful spans with [EOS] to suppress unsafe continuations at any position.

Robustness

Any-order / Any-step

Safety holds under multiple decoding strategies and late-stage injections.

Practical benefit

Early rejection

Use P([EOS]) as a real-time safety signal and fast refusal trigger.

Why A2D?

Figure: A2D overview. Token-level alignment replaces harmful masked targets with [EOS] so the model can refuse even under template-based prefilling.

Diffusion decoding differs from autoregressive decoding: instead of generating strictly left-to-right, dLLMs fill masked tokens over multiple steps, and the next token position can be chosen flexibly at runtime. This flexibility is a strength—but it also creates new safety failure modes.

Any-order vulnerability: harmful spans can be generated at arbitrary positions because token order is not fixed.
Any-step vulnerability: even if a model refuses early, safety signals can decay over steps (“shallow alignment”), enabling late-stage injections.
Template-based prefilling (DIJA): attackers interleave harmful text and [MASK] slots to force the model to “complete the blanks” after refusals fade.

Figure: Vulnerability illustration / shallow alignment

Method: Token-Level Alignment with `[EOS]`

A2D modifies the standard masked diffusion training objective with a single, targeted change: for harmful completions, masked tokens inside harmful spans are supervised to output [EOS] instead of reconstructing the original tokens. For retain data (safe completions and safe answers to harmful-looking prompts), masked tokens are trained to reconstruct the original tokens as usual.

This makes [EOS] a universal “suppression token” that can appear at any position during generation, so refusals are no longer tied to the start of the response. Inference can also monitor P([EOS]) across steps to detect unsafe drifts and terminate mid-flight.

Training data: Harmful set (unsafe prompt + unsafe response) and Retain set (safe responses, including benign responses to suspicious prompts).
Random masking: sample timestep / mask ratio so the model sees both early and late decoding stages during training.
Target rule: if a masked token belongs to a harmful span → target [EOS]; otherwise → target original token.
Outcome: robust suppression under diverse partial contexts and decoding strategies.

Figure: A2D training algorithm. Same diffusion training “shape”, different targets for harmful spans.

Experimental Results

Safety Robustness

We evaluate jailbreak robustness across both black-box attacks (Zeroshot, PAIR, ReNeLLM) and white-box attacks (Prefilling, DIJA). A2D consistently achieves the lowest average attack success rate (ASR) across multiple instruction-tuned dLLMs, while preserving general capability (general knowledge, math, coding).

Notably, A2D drives DIJA ASR to near-zero across models (e.g., 1.3% on LLaDA, 3.8% on LLaDA-1.5, and 0.0% on Dream), showing that token-level suppression remains effective even when attacks manipulate intermediate masked slots.

Figure: Comprehensive results (capability vs. harmfulness).

Decoding strategy robustness placeholder

Figure: Any-order robustness across decoding strategies (left-to-right, right-to-left, confidence, entropy, random).

Efficiency / Early Rejection

A2D provides a real-time safety signal via the model’s probability mass on [EOS]. In practice, we can use P([EOS]) at the leftmost masked token in the first decoding step as an early rejection indicator: if it exceeds a threshold τ, the model halts without generating a full response.

This yields large speedups for refusing harmful prompts (up to 19.3× faster termination) while maintaining high compliance on benign prompts.

Figure: Early rejection trade-off: speedup vs. benign compliance as a function of threshold τ.

BibTeX

@article{jeung2025a2d,
      title={A2d: Any-order, any-step safety alignment for diffusion language models},
      author={Jeung, Wonje and Yoon, Sangyeon and Cho, Yoonjun and Jeon, Dongjae and Shin, Sangwoo and Hong, Hyesoo and No, Albert},
      journal={arXiv preprint arXiv:2509.23286},
      year={2025}
    }