Skip to main content

REVIEW article

Front. Artif. Intell.
Sec. Medicine and Public Health
Volume 7 - 2024 | doi: 10.3389/frai.2024.1430984

Vision-Language Models for Medical Report Generation and Visual Question Answering: A Review

Provisionally accepted
  • Department of Machine Learning, H. Lee Moffitt Cancer Center and Research Institute, Tampa, United States

The final, formatted version of the article will be published soon.

    Medical vision-language models (VLMs) combine computer vision (CV) and natural language processing (NLP) to analyze visual and textual medical data. Our paper reviews recent advancements in developing VLMs specialized for healthcare, focusing on publicly available models designed for medical report generation and visual question answering (VQA). We provide background on NLP and CV, explaining how techniques from both fields are integrated into VLMs, with visual and language data often fused using Transformer-based architectures to enable effective learning from multimodal data. Key areas we address include the exploration of 18 public medical vision-language datasets, in-depth analyses of the architectures and pre-training strategies of 16 recent noteworthy medical VLMs, and comprehensive discussion on evaluation metrics for assessing VLMs' performance in medical report generation and VQA. We also highlight current challenges facing medical VLM development, including limited data availability, concerns with data privacy, and lack of proper evaluation metrics, among others, while also proposing future directions to address these obstacles. Overall, our review summarizes the recent progress in developing VLMs to harness multimodal medical data for improved healthcare applications.

    Keywords: vision-language models, Report generation, Visual question answering, Datasets, Evaluation metrics, healthcare

    Received: 10 May 2024; Accepted: 31 Oct 2024.

    Copyright: © 2024 Hartsock and Rasool. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

    * Correspondence: Iryna Hartsock, Department of Machine Learning, H. Lee Moffitt Cancer Center and Research Institute, Tampa, United States

    Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.