* Equal contribution. † Corresponding author.
Diffusion large language models (dLLMs) enable any-order generation, but this flexibility enlarges the attack surface:
harmful spans may appear at arbitrary positions, and template-based prefilling attacks (e.g., DIJA) can bypass
response-level refusals. We introduce A2D (Any-Order, Any-Step Defense), a token-level alignment method that
trains dLLMs to emit an [EOS] refusal signal whenever harmful content arises. By aligning safety directly at
masked token positions under randomized masking, A2D remains robust across any decoding order and any step
of diffusion generation. On safety benchmarks, A2D reduces DIJA success rates from >80% to near-zero
(e.g., 1.3% on LLaDA-8B-Instruct; 0.0% on Dream-v0-Instruct-7B) and enables early rejection with up to
19.3× faster safe termination.
Disclaimer: This project discusses adversarial prompting and unsafe requests for the purpose of safety research. We do not provide harmful instructions.
[EOS][EOS] to suppress unsafe continuations at any position.P([EOS]) as a real-time safety signal and fast refusal trigger.
Figure: A2D overview. Token-level alignment replaces harmful masked targets with [EOS] so the model can refuse even under template-based prefilling.
Diffusion decoding differs from autoregressive decoding: instead of generating strictly left-to-right, dLLMs fill masked tokens over multiple steps, and the next token position can be chosen flexibly at runtime. This flexibility is a strength—but it also creates new safety failure modes.
[MASK] slots to force the model to “complete the blanks” after refusals fade.
Figure: Vulnerability illustration / shallow alignment
[EOS]
A2D modifies the standard masked diffusion training objective with a single, targeted change:
for harmful completions, masked tokens inside harmful spans are supervised to output [EOS]
instead of reconstructing the original tokens. For retain data (safe completions and safe answers to harmful-looking prompts),
masked tokens are trained to reconstruct the original tokens as usual.
This makes [EOS] a universal “suppression token” that can appear at any position during generation,
so refusals are no longer tied to the start of the response. Inference can also monitor P([EOS])
across steps to detect unsafe drifts and terminate mid-flight.
[EOS]; otherwise → target original token.
Figure: A2D training algorithm. Same diffusion training “shape”, different targets for harmful spans.
We evaluate jailbreak robustness across both black-box attacks (Zeroshot, PAIR, ReNeLLM) and white-box attacks (Prefilling, DIJA). A2D consistently achieves the lowest average attack success rate (ASR) across multiple instruction-tuned dLLMs, while preserving general capability (general knowledge, math, coding).
Notably, A2D drives DIJA ASR to near-zero across models (e.g., 1.3% on LLaDA, 3.8% on LLaDA-1.5, and 0.0% on Dream), showing that token-level suppression remains effective even when attacks manipulate intermediate masked slots.
Figure: Comprehensive results (capability vs. harmfulness).
Figure: Any-order robustness across decoding strategies (left-to-right, right-to-left, confidence, entropy, random).
A2D provides a real-time safety signal via the model’s probability mass on [EOS].
In practice, we can use P([EOS]) at the leftmost masked token in the first decoding step as an early rejection indicator:
if it exceeds a threshold τ, the model halts without generating a full response.
This yields large speedups for refusing harmful prompts (up to 19.3× faster termination) while maintaining high compliance on benign prompts.
Figure: Early rejection trade-off: speedup vs. benign compliance as a function of threshold τ.
@article{jeung2025a2d,
title={A2d: Any-order, any-step safety alignment for diffusion language models},
author={Jeung, Wonje and Yoon, Sangyeon and Cho, Yoonjun and Jeon, Dongjae and Shin, Sangwoo and Hong, Hyesoo and No, Albert},
journal={arXiv preprint arXiv:2509.23286},
year={2025}
}