A critical flaw in artificial intelligence systems used to analyze medical images could put patients at risk, according to new research from MIT published this week.
The study, led by graduate student Kumail Alhamoud and Associate Professor Marzyeh Ghassemi, reveals that vision-language models (VLMs) – AI systems widely deployed in healthcare settings – fundamentally fail to understand negation words like 'no' and 'not' when analyzing medical images.
"Those negation words can have a very significant impact, and if we are just using these models blindly, we may run into catastrophic consequences," warns Alhamoud, the study's lead author.
The researchers demonstrated this problem through a clinical example: if a radiologist examines a chest X-ray showing tissue swelling but no enlarged heart, an AI system might incorrectly retrieve cases with both conditions, potentially leading to an entirely different diagnosis. When formally tested, these AI models performed no better than random guessing on negation tasks.
To address this critical limitation, the team has developed NegBench, a comprehensive evaluation framework spanning 18 task variations and 79,000 examples across image, video, and medical datasets. Their proposed solution involves retraining VLMs with specially created datasets containing millions of negated captions, which has shown promising results – improving recall on negated queries by 10% and boosting accuracy on multiple-choice questions with negated captions by 28%.
"If something as fundamental as negation is broken, we shouldn't be using large vision/language models in many of the ways we are using them now – without intensive evaluation," cautions Ghassemi, highlighting the need for careful assessment before deploying these systems in high-stakes medical environments.
The research, which includes collaborators from OpenAI and Oxford University, will be presented at the upcoming Conference on Computer Vision and Pattern Recognition. The team has made their benchmark and code publicly available to help address this critical AI safety issue.