To prevent probability mass from concentrating on a single token, we propose Rainbow Padding.
The true response end uses a single <eos>
token, while remaining positions are filled with a cyclic sequence of K distinct padding tokens:
Rainbow Padding
LLaDA Instruct
Discrete Diffusion large language models (dLLMs) have recently emerged as a promising alternative to
traditional
autoregressive LLMs. Unlike autoregressive models, which generate strictly left-to-right, dLLMs allow
tokens to be
generated in any order.
However, current instruction-tuned dLLMs suffer from a critical reliability issue. When users allocate
longer generation
budgets (max_length), these models often produce shorter responses:
terminating early or degenerating into streams of <eos>
tokens. We refer to this
failure
mode as
<eos>
overflow.
To address this failure, we introduce Rainbow Padding, a simple yet effective modification to the
padding scheme. Rather than repeating <eos>
throughout the tail, we reserve a single
<eos>
to
mark the true end of the sequence and fill the remainder with a cyclic palette of distinct padding
tokens.
Unlike autoregressive models, dLLMs operate on a fixed sequence length at each decoding step,
requiring users to specify this length (max_length
) before generation.
While setting max_length
too short can truncate responses, we expect that allocating sufficient tokens would enable high-quality outputs.
Surprisingly, Figure 1 reveals the opposite: LLaDA-Instruct's performance drops substantially as max_length
increases!
The performance collapse is particularly severe on MATH, dropping from 17.1% to 2.9% when max_length increases from 128 to 256.
This is puzzling—256 tokens is modest by modern LLM standards, yet performance degrades dramatically.
Why does this happen? Does longer max_length
cause the model to generate meaningless tokens?
We found the opposite: models actually produce shorter responses as max_length
increases (Figure 2 (a)).
The root cause is using <eos>
tokens for both termination and padding during instruction tuning.
Figure 2 (b) shows how current dLLMs pad shorter sequences with <eos>
tokens in training.
This causes the model to learn inflated <eos>
probabilities at later positions.
Figure 2 (c) reveals the cascade effect: <eos>
confidence approaches 1.0 after ~400 tokens.
Under adaptive decoding strategies (confidence, margin, entropy), these high probabilities trigger early <eos>
sampling at later positions, which then biases earlier positions (even 10+ tokens back) toward <eos>
as well.
This creates a feedback loop: inflated <eos>
probabilities → adaptive sampling selects <eos>
early → backward propagation → premature termination.
We call this phenomenon <eos>
overflow.
To prevent probability mass from concentrating on a single token, we propose Rainbow Padding.
The true response end uses a single <eos>
token, while remaining positions are filled with a cyclic sequence of K distinct padding tokens:
We instruction-tuned pretrained models (LLaDA-Base, Dream-Base) using both <eos>
padding and Rainbow Padding.
Models trained with Rainbow Padding maintain stable performance across different max_length
values,
consistently producing appropriate response lengths and high accuracy.
Notably, LLaDA achieves strong performance without requiring heuristic block-wise decoding strategies.
Rainbow Padding also adapts efficiently to existing instruction-tuned models. We fine-tuned LLaDA-Instruct using LoRA for just one epoch on 0.5M public examples.
As shown at right, the model quickly adapts to Rainbow Padding and resolves the overflow issue with minimal retraining, despite being originally trained with <eos>
padding.
For details, refer to our paper.