[CVPR 2025 Oral (0.74%)]

OPA-DPO

Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key

Zhihe Yang ^1,3 ⇑ (Expected to graduate in June 2026, seeking for internships now.) , Xufang Luo^2*, Dongqi Han², Yunjian Xu^1,3*, Dongsheng Li²,

▶ 1. The Chinese University of Hong Kong, Hong Kong SAR, China.
▶ 2. Microsoft Research Asia, Shanghai, China.
▶ 3. The Chinese University of Hong Kong, Shenzhen Research Institute (SZRI), Guangdong, China.

^*Corresponding authors

Paper Code Datasets Models

Abstract

Hallucination remains a major challenge for Large Vision-Language Models (LVLMs), also referred to as Multimodal Large Language Models (MLLMs). Direct Preference Optimization (DPO) has gained increasing attention as a simple solution to hallucination issues. It directly learns from constructed preference pairs that reflect the severity of hallucinations in responses to the same prompt and image. Nonetheless, we noticed that different data construction methods in existing works bring notable performance variations. Furthermore, we have uncovered the underlying reasons for these variations from a theoretical perspective. In short, our contributions are as follows:

Systematic Review: We present a comprehensive review of existing DPO-based algorithms for mitigating hallucination issues and highlight their limitations.
Crucial Factor Idenfication: From theoretical perspective, we identify that outcomes are highly dependent on whether the constructed data aligns on-policy with the initial (reference) policy of DPO.
SOTA Algorithm: Building upon the limitations of existing methods, we propose a simple yet highly effective algorithm, On-Policy Alignment (OPA)-DPO, which achieves SOTA performance with only 4.8k data, while the previous SOTA one requires 16k.

Introduction

Motivation and Performance Summary:

To mitigate hallucinations in LVLMs (or MLLMs) with DPO, the most effective approach would be to have the model generate a response based on a given prompt and image, followed by experts correcting hallucinations in the generated content to construct preference pairs. However, in practice, even when these corrections are minor, the corrected responses are often off-policy relative to the original model (i.e., they have extremely low sampling probabilities). We reveal a key point that are often neglected in existing works: the off-policy prefferred response can NEVER be learnt by the models due to the implicit KL constraints. Based on this observation, we propose OPA-DPO, which aligns the constructed data on-policy before DPO training. Experimental results demonstrate that OPA-DPO significantly improves performance by incorporating the OPA operation and achieves SOTA results with minimal data requirements.

Demo of Kullback-Leibler (KL) Divergence:

Noticed that the initial training objective of DPO is to maximize the reward-model induced by Bradley-Terry model, while constraining the KL divergence between the model and the reference policy. To provide an intuitive understanding of the importance of on-policy data, we visualize the KL divergence between the current policy (\(\,\pi_\theta\,\)) and the reference policy (\(\,\pi_{ref}\,\)). You can drag the sliders to adjust the mean and variance of the current policy, and observe how the KL divergence changes accordingly. In summary, the KL divergence becomes substantially large if the current policy generates tokens that the reference policy never produces.

\(KL (\pi_\theta \, || \,\pi_{ref})= \) 0.000

\(\pi_\theta\)

\(\pi_{ref}\)

Mean of \(\pi_\theta\):

Var of \(\pi_\theta\):

Review on Related Algorithms

We categorize existing DPO-based algorithms for addressing hallucination issues in LVLMs into 3 classes:

Hallucination Injection: (POVID, and HALVA). The ground-truth response is preferred, while the rejected response contains injected hallucinations. Since the errors do not originate from the model itself, the policy is unlikely to benefit from training.
Hallucination Recognition: (RLHF-V, HA-DPO and HSA-DPO). The model generates responses, after which experts (AI or human) identify errors and make revisions. The off-policy nature of the revised responses makes them challenging to learn effectively.
Self Evolution: (RLAIF-V). Both preferred and rejected responses are generated by the initial policy. A superior model assesses hallucinations, preferring the response with fewer errors. However, hallucinations may exist in both responses, thereby affecting the learning efficiency.

On-Policy Alignment (OPA)-DPO

Our proposed OPA-DPO comprises four essential steps:

\(\quad\)Step 1: Collect responses from the original policy based on the images and corresponding prompts.

\(\quad\)Step 2: Utilize GPT-4V to correct any hallucinations in the generated responses with minimal modifications.

\(\quad\)Step 3: Conduct LoRA-SFT on the GT responses and revised responses.

\(\quad\)Step 4: Initiate OPA-DPO training from the policy obtained in step 3. Note that we construct extra image-focused and anchored preference pairs following mDPO.

Experimental Results

Hallucination Bench Evaluations:

\(\quad\)AMBER: A benchmark with detailed object annotations, featuring 1004 images in a generative task. Using the official codebase, we evaluate CHAIR score, object coverage, hallucination rate, and alignment with human cognition.

\(\quad\)MMHalBench: A question-answering benchmark with 96 images across 12 object categories. Following the official protocol, we use GPT-4 to rate responses from zero to six, calculating hallucination rate by the proportion of responses rated below three.

\(\quad\)Object HalBench: A widely used benchmark for assessing object hallucination. We evaluate across 300 instances using the RLHF-V codebase, reporting hallucination rates at both response (CHAIRs) and object levels (CHAIRi).

\(\quad\)POPE: A yes/no question-answering benchmark for object hallucination evaluation. We report accuracy and precision on its Adversarial set, consisting of 3000 cases.

Ablation Studies:

Qualitative Examples

Image description tasks: OPA-DPO helps to significantly reduce hallucination issues. Nevertheless, we have observed that models trained with the OPA-DPO framework tend to adopt a slightly conservative strategy, often disregarding some insignificant details.

False premise queries: An interesting phenomenon we observed is that LVLMs consistently exhibit hallucinations when presented with queries based on false premises. These queries include objects or details that either do not exist in the image or are irrelevant to it. OPA-DPO can partially mitigate this issue.

BibTeX


        @article{yang2025opadpo,
            title={Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key},
            author={Yang, Zhihe and Luo, Xufang and Han, Dongqi and Xu, Yunjian and Li, Dongsheng},
            journal={arXiv preprint arXiv:2501.09695},
            year={2025}
          }

Acknowledgement

We would like to express our gratitude for the code snippets provided in LLaVA, LLaVA-RLHF, FastChat, TRL, and datasets provided in RLAIF-V. These resources have significantly contributed to the development of our project.