Hallucination remains a major challenge for Large Vision-Language Models (LVLMs), also referred to as Multimodal Large Language Models (MLLMs). Direct Preference Optimization (DPO) has gained increasing attention as a simple solution to hallucination issues. It directly learns from constructed preference pairs that reflect the severity of hallucinations in responses to the same prompt and image. Nonetheless, we noticed that different data construction methods in existing works bring notable performance variations. Furthermore, we have uncovered the underlying reasons for these variations from a theoretical perspective. In short, our contributions are as follows:
Motivation and Performance Summary:
To mitigate hallucinations in LVLMs (or MLLMs) with DPO, the most effective approach would be to have the model generate a response based on a given prompt and image, followed by experts correcting hallucinations in the generated content to construct preference pairs. However, in practice, even when these corrections are minor, the corrected responses are often off-policy relative to the original model (i.e., they have extremely low sampling probabilities). We reveal a key point that are often neglected in existing works: the off-policy prefferred response can NEVER be learnt by the models due to the implicit KL constraints. Based on this observation, we propose OPA-DPO, which aligns the constructed data on-policy before DPO training. Experimental results demonstrate that OPA-DPO significantly improves performance by incorporating the OPA operation and achieves SOTA results with minimal data requirements.
Demo of Kullback-Leibler (KL) Divergence:
Noticed that the initial training objective of DPO is to maximize the reward-model induced by Bradley-Terry model, while constraining the KL divergence between the model and the reference policy.
To provide an intuitive understanding of the importance of on-policy data, we visualize the KL divergence between the current policy (
We categorize existing DPO-based algorithms for addressing hallucination issues in LVLMs into 3 classes:
Our proposed OPA-DPO comprises four essential steps:
Hallucination Bench Evaluations:
Ablation Studies:
@article{yang2025opadpo,
title={Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key},
author={Yang, Zhihe and Luo, Xufang and Han, Dongqi and Xu, Yunjian and Li, Dongsheng},
journal={arXiv preprint arXiv:2501.09695},
year={2025}
}