A new study from MIT researchers has revealed a fundamental flaw in vision-language models (VLMs) that could have serious implications for medical diagnostics and other critical applications.
The research team, led by Kumail Alhamoud and senior author Marzyeh Ghassemi from MIT's Department of Electrical Engineering and Computer Science, found that these AI systems—which are increasingly used to analyze medical images—fail to understand negation words like 'no' and 'not' in queries.
This limitation becomes particularly problematic in medical contexts. For example, when a radiologist examines a chest X-ray showing tissue swelling without an enlarged heart, using an AI system to find similar cases could lead to incorrect diagnoses if the model cannot distinguish between the presence and absence of specific conditions.
"Those negation words can have a very significant impact, and if we are just using these models blindly, we may run into catastrophic consequences," warns lead author Alhamoud. When tested on their ability to identify negation in image captions, the models performed no better than random guessing.
To address this problem, the researchers developed NegBench, a comprehensive benchmark with 79,000 examples across 18 task variations spanning image, video, and medical datasets. The benchmark evaluates two core capabilities: retrieving images based on negated queries and answering multiple-choice questions with negated captions.
The team also created datasets with negation-specific examples to retrain these models, achieving a 10% improvement in recall on negated queries and a 28% boost in accuracy on multiple-choice questions with negated captions. However, they caution that more work is needed to address the root causes of this problem.
"If something as fundamental as negation is broken, we shouldn't be using large vision/language models in many of the ways we are using them now—without intensive evaluation," emphasizes Ghassemi.
The research will be presented at the upcoming Conference on Computer Vision and Pattern Recognition, highlighting the urgent need for more robust AI systems in critical applications like healthcare.