Hallucination remains a major challenge for Large Vision-Language Models (LVLMs), also referred to as Multimodal Large Language Models (MLLMs). Direct Preference Optimization (DPO) has gained increasing attention as a simple solution to hallucination issues. It directly learns from constructed preference pairs that reflect the severity of hallucinations in responses to the same prompt and image. Nonetheless, we noticed that different data construction methods in existing works bring notable performance variations. Furthermore, we have uncovered the underlying reasons for these variations from a theoretical perspective. In short, our contributions are as follows:
Motivation and Performance Summary:
To mitigate hallucinations in LVLMs (or MLLMs) with DPO, the most effective approach would be to have the model generate a response based on a given prompt and image, followed by experts correcting hallucinations in the generated content to construct preference pairs. However, in practice, even when these corrections are minor, the corrected responses are often off-policy relative to the original model (i.e., they have extremely low sampling probabilities). We reveal a key point that are often neglected in existing works: the off-policy prefferred response can NEVER be learnt by the models due to the implicit KL constraints. Based on this observation, we propose OPA-DPO, which aligns the constructed data on-policy before DPO training. Experimental results demonstrate that OPA-DPO significantly improves performance by incorporating the OPA operation and achieves SOTA results with minimal data requirements.
 
        Demo of Kullback-Leibler (KL) Divergence:
Noticed that the initial training objective of DPO is to maximize the reward-model induced by Bradley-Terry model, while constraining the KL divergence between the model and the reference policy. To provide an intuitive understanding of the importance of on-policy data, we visualize the KL divergence between the current policy (\(\,\pi_\theta\,\)) and the reference policy (\(\,\pi_{ref}\,\)). You can drag the sliders to adjust the mean and variance of the current policy, and observe how the KL divergence changes accordingly. In summary, the KL divergence becomes substantially large if the current policy generates tokens that the reference policy never produces.
We categorize existing DPO-based algorithms for addressing hallucination issues in LVLMs into 3 classes:
 
          Our proposed OPA-DPO comprises four essential steps:
\(\quad\)Step 1: Collect responses from the original policy based on the images and corresponding prompts.
\(\quad\)Step 2: Utilize GPT-4V to correct any hallucinations in the generated responses with minimal modifications.
\(\quad\)Step 3: Conduct LoRA-SFT on the GT responses and revised responses.
\(\quad\)Step 4: Initiate OPA-DPO training from the policy obtained in step 3. Note that we construct extra image-focused and anchored preference pairs following mDPO.
 
          Hallucination Bench Evaluations:
\(\quad\)AMBER: A benchmark with detailed object annotations, featuring 1004 images in a generative task. Using the official codebase, we evaluate CHAIR score, object coverage, hallucination rate, and alignment with human cognition.
\(\quad\)MMHalBench: A question-answering benchmark with 96 images across 12 object categories. Following the official protocol, we use GPT-4 to rate responses from zero to six, calculating hallucination rate by the proportion of responses rated below three.
\(\quad\)Object HalBench: A widely used benchmark for assessing object hallucination. We evaluate across 300 instances using the RLHF-V codebase, reporting hallucination rates at both response (CHAIRs) and object levels (CHAIRi).
\(\quad\)POPE: A yes/no question-answering benchmark for object hallucination evaluation. We report accuracy and precision on its Adversarial set, consisting of 3000 cases.
 
              Ablation Studies:
 
           
           
           
          
        @article{yang2025opadpo,
            title={Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key},
            author={Yang, Zhihe and Luo, Xufang and Han, Dongqi and Xu, Yunjian and Li, Dongsheng},
            journal={arXiv preprint arXiv:2501.09695},
            year={2025}
          }