Rainbow Padding: Mitigating Early Termination
in Instruction-Tuned Diffusion LLMs

Yonsei University1
* indicates equal contribution; † indicates corresponding author.

TL;DR: We introduce Rainbow Padding, a cyclic multi-token padding scheme
that eliminates early termination and restores length robustness in instruction-tuned diffusion LLMs.

Rainbow Padding

Prompt: [4-shot: 4 example problems with solutions are provided above]
Problem: Mr. Madoff invests 1000 dollars in a fund that compounds annually at a constant interest rate. After three years, his investment has grown to 1225 dollars. What is the annual interest rate, as a percentage?
Solution:
Let $r$ be the annual interest rate, expressed as a decimal. After three years, Mr. Madoff's investment has grown to $1000(1+r)^3=1225$. Dividing both sides by 1000, we get $(1+r)^3=1.225$. Taking the cube root of both sides, we get $1+r=\sqrt[3]{1.225}\approx 1.071$. Subtracting 1 from both sides, we get $r\approx 0.071$. Converting to a percentage, we get $r\approx 7.1\%$.
Final Answer: The final answer is $7.1$. I hope it is correct.

LLaDA Instruct

Prompt: [4-shot: 4 example problems with solutions are provided above]
Problem: Mr. Madoff invests 1000 dollars in a fund that compounds annually at a constant interest rate. After three years, his investment has grown to 1225 dollars. What is the annual interest rate, as a percentage?
Solution:
To find the annual interest rate, we can use the formula for compound interest.

Introduction

Discrete Diffusion large language models (dLLMs) have recently emerged as a promising alternative to traditional autoregressive LLMs. Unlike autoregressive models, which generate strictly left-to-right, dLLMs allow tokens to be generated in any order. However, current instruction-tuned dLLMs suffer from a critical reliability issue. When users allocate longer generation budgets (max_length), these models often produce shorter responses: terminating early or degenerating into streams of <eos> tokens. We refer to this failure mode as <eos> overflow. To address this failure, we introduce Rainbow Padding, a simple yet effective modification to the padding scheme. Rather than repeating <eos> throughout the tail, we reserve a single <eos> to mark the true end of the sequence and fill the remainder with a cyclic palette of distinct padding tokens.

What is the problem we deal with?

Unlike autoregressive models, dLLMs operate on a fixed sequence length at each decoding step, requiring users to specify this length (max_length) before generation. While setting max_length too short can truncate responses, we expect that allocating sufficient tokens would enable high-quality outputs.

Surprisingly, Figure 1 reveals the opposite: LLaDA-Instruct's performance drops substantially as max_length increases! The performance collapse is particularly severe on MATH, dropping from 17.1% to 2.9% when max_length increases from 128 to 256. This is puzzling—256 tokens is modest by modern LLM standards, yet performance degrades dramatically. Why does this happen? Does longer max_length cause the model to generate meaningless tokens?

Overview results - Panel A
Description of the image

We found the opposite: models actually produce shorter responses as max_length increases (Figure 2 (a)). The root cause is using <eos> tokens for both termination and padding during instruction tuning.

Figure 2 (b) shows how current dLLMs pad shorter sequences with <eos> tokens in training. This causes the model to learn inflated <eos> probabilities at later positions. Figure 2 (c) reveals the cascade effect: <eos> confidence approaches 1.0 after ~400 tokens. Under adaptive decoding strategies (confidence, margin, entropy), these high probabilities trigger early <eos> sampling at later positions, which then biases earlier positions (even 10+ tokens back) toward <eos> as well.

This creates a feedback loop: inflated <eos> probabilities → adaptive sampling selects <eos> early → backward propagation → premature termination. We call this phenomenon <eos> overflow.

A Simple Remedy: Rainbow Padding

Description of the image

To prevent probability mass from concentrating on a single token, we propose Rainbow Padding. The true response end uses a single <eos> token, while remaining positions are filled with a cyclic sequence of K distinct padding tokens:

$$ \mathcal{P}=\{\texttt{\langle pad_0\rangle},\,\texttt{\langle pad_1\rangle},\ldots,\,\texttt{\langle pad_{K-1}\rangle}\}. $$
This distributes probability mass evenly across multiple padding tokens, reducing individual confidence. Under adaptive decoding strategies, no single padding token has high enough probability for early selection, effectively mitigating early termination.

Experimental Results

LLaDA table
Dream table

We instruction-tuned pretrained models (LLaDA-Base, Dream-Base) using both <eos> padding and Rainbow Padding. Models trained with Rainbow Padding maintain stable performance across different max_length values, consistently producing appropriate response lengths and high accuracy. Notably, LLaDA achieves strong performance without requiring heuristic block-wise decoding strategies.

Rainbow Padding also adapts efficiently to existing instruction-tuned models. We fine-tuned LLaDA-Instruct using LoRA for just one epoch on 0.5M public examples. As shown at right, the model quickly adapts to Rainbow Padding and resolves the overflow issue with minimal retraining, despite being originally trained with <eos> padding. For details, refer to our paper.

Overview results - Panel A

BibTeX