Rainbow Padding: Mitigating Early Termination in Instruction-Tuned Diffusion LLMs

Rainbow Padding: Mitigating Early Termination
in Instruction-Tuned Diffusion LLMs

Yonsei University¹

* indicates equal contribution; † indicates corresponding author.

TL;DR: We introduce Rainbow Padding, a cyclic multi-token padding scheme
that eliminates early termination and restores length robustness in instruction-tuned diffusion LLMs.

Let's denote the annual interest rate as $ r $ (in decimal form). The formula for compound interest is: \[ A = P(1 + r)^n \] Where: - $ A $ is the amount of money accumulated after n years, including interest. - $ P $ is the principal amount (the initial amount of money). - $ r $ is the annual interest rate (decimal). - $ n $ is the number of years the money is invested or borrowed for. Given: - $ P = 1000 $ dollars - $ A = 1225 $ dollars - $ n = 3 $ years We need to find $ r $. First, substitute the given values into the formula: \[ 1225 = 1000(1 + r)^3 \] To solve for $ r $, we need to isolate $ (1 + r)^3 $: \[ (1 + r)^3 = \frac{1225}{1000} \] \[ (1 + r)^3 = 1.225 \] Now, take the cube root of both sides to solve for $ 1 + r $: \[ 1 + r = \sqrt[3]{1.225} \] \[ 1 + r \approx 1.071 \] Subtract 1 from both sides to find $ r $: \[ r \approx 1.071 - 1 \] \[ r \approx 0.071 \] To express $ r $ as a percentage, multiply by 100: \[ r \approx 0.071 \times 100 \] \[ r \approx 7.1 \] Rounding to the nearest integer, the annual interest rate is approximately $\boxed{7}$ percent.

Introduction

Discrete Diffusion large language models (dLLMs) have recently emerged as a promising alternative to traditional autoregressive LLMs. Unlike autoregressive models, which generate strictly left-to-right, dLLMs allow tokens to be generated in any order. However, current instruction-tuned dLLMs suffer from a critical reliability issue. When users allocate longer generation budgets (max_length), these models often produce shorter responses: terminating early or degenerating into streams of <eos> tokens. We refer to this failure mode as <eos> overflow. To address this failure, we introduce Rainbow Padding, a simple yet effective modification to the padding scheme. Rather than repeating <eos> throughout the tail, we reserve a single <eos> to mark the true end of the sequence and fill the remainder with a cyclic palette of distinct padding tokens.

What is the problem we deal with?

Unlike autoregressive models, dLLMs operate on a fixed sequence length at each decoding step, requiring users to specify this length (max_length) before generation. While setting max_length too short can truncate responses, we expect that allocating sufficient tokens would enable high-quality outputs.

Surprisingly, Figure 1 reveals the opposite: LLaDA-Instruct's performance drops substantially as max_length increases! The performance collapse is particularly severe on MATH, dropping from 17.1% to 2.9% when max_length increases from 128 to 256. This is puzzling—256 tokens is modest by modern LLM standards, yet performance degrades dramatically. Why does this happen? Does longer max_length cause the model to generate meaningless tokens?

We found the opposite: models actually produce shorter responses as max_length increases (Figure 2 (a)). The root cause is using <eos> tokens for both termination and padding during instruction tuning.

Figure 2 (b) shows how current dLLMs pad shorter sequences with <eos> tokens in training. This causes the model to learn inflated <eos> probabilities at later positions. Figure 2 (c) reveals the cascade effect: <eos> confidence approaches 1.0 after ~400 tokens. Under adaptive decoding strategies (confidence, margin, entropy), these high probabilities trigger early <eos> sampling at later positions, which then biases earlier positions (even 10+ tokens back) toward <eos> as well.

This creates a feedback loop: inflated <eos> probabilities → adaptive sampling selects <eos> early → backward propagation → premature termination. We call this phenomenon <eos> overflow.

A Simple Remedy: Rainbow Padding

To prevent probability mass from concentrating on a single token, we propose Rainbow Padding. The true response end uses a single <eos> token, while remaining positions are filled with a cyclic sequence of K distinct padding tokens:

$$ \mathcal{P}=\{\texttt{\langle pad_0\rangle},\,\texttt{\langle pad_1\rangle},\ldots,\,\texttt{\langle pad_{K-1}\rangle}\}. $$

This distributes probability mass evenly across multiple padding tokens, reducing individual confidence. Under adaptive decoding strategies, no single padding token has high enough probability for early selection, effectively mitigating early termination.

Experimental Results

We instruction-tuned pretrained models (LLaDA-Base, Dream-Base) using both <eos> padding and Rainbow Padding. Models trained with Rainbow Padding maintain stable performance across different max_length values, consistently producing appropriate response lengths and high accuracy. Notably, LLaDA achieves strong performance without requiring heuristic block-wise decoding strategies.

Rainbow Padding also adapts efficiently to existing instruction-tuned models. We fine-tuned LLaDA-Instruct using LoRA for just one epoch on 0.5M public examples. As shown at right, the model quickly adapts to Rainbow Padding and resolves the overflow issue with minimal retraining, despite being originally trained with <eos> padding. For details, refer to our paper.