Multimedia ResearchISSN:2582-547X

Evaluation of a Multimodal Custom Finetuned LLM for Virtual Healthcare Consultations

Abstract

We built a modular and privacy-focused prototype of a multimodal virtual medical assistant that uses retrieval- augmented generation (RAG) to improve healthcare consultations. The motivation behind this system is to bridge the gap between traditional telemedicine and intelligent diagnostic support by enabling AI-driven consultations that are context- aware, multimodal, and privacy-preserving. The system runs a locally deployed LLaMA 3.2 (11B) model with 4-bit quantization, keeping it lightweight yet efficient. It can process both text and images, and has been fine-tuned on 50,000 image label pairs from the MedTrinity dataset, which includes a wide range of medical images and descriptions. This fine- tuning improves the model’s ability to answer multimodal medical questions. To enhance interpretability, the model’s outputs are supported by transparent reasoning traces that indicate whether the response is derived from visual understanding, textual retrieval, or both.The assistant supports text, image, and speech inputs. Speech is transcribed using the AssemblyAI transcription API. For RAG, we use ChromaDB to store and retrieve medical documents from the MedQuAD dataset, which includes about 41,000 medicine-related question answer pairs. This integration enables the system to fetch domain-relevant evidence dynamically, helping users verify the medical reliability of generated responses. We evaluate our fine-tuned model against the base LLaMA 3.2 model and the responses are judged using OpenAI’s GPT-4.1 as an evaluator. Performance is measured on the MMMU benchmark, focusing on three medical domains:1) Basic medical science, 2) Clinical medicine, 3) Diagnostic and laboratory medicine. Each model variant (with and without RAG) was tested on 30 questions per domain, and evaluated under strict and non-strict scoring criteria. The evaluation reveals that fine-tuning significantly enhances answer relevance and domain fluency, while RAG contributes variably depending on retrieval quality, underscoring the need for domain-specific curation in medical AI systems.

References

  • Arvind Kasthuri (2018). Challenges to Healthcare in India – The Five As. https://pmc.ncbi.nlm.nih.gov/articles/PMC6166510/
  • Asma Ben Abacha and Dina Demner Fushman (2019). A Question-Entailment Approach to Question Answering. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3119-4
  • Daniel Han, Michael Han, and Unsloth Team (2023). Unsloth. http://github.com/unslothai/unsloth
  • Yunfei Xie et al. MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine. https://arxiv.org/abs/2408.02900
  • Touvron, H., et al. (2024). Llama 3: Herd of Models. https://arxiv.org/abs/2407.21783
  • Dettmers, T., Pagnoni, A., Holtzman, A., Zettlemoyer, L. QLoRA: Efficient Fine-tuning of Quantized LLMs. https://arxiv.org/abs/2305.14314
  • Yang Liu et al. G-Eval: NLG Evaluation Using GPT-4 with Better Human Alignment. https://arxiv.org/abs/2303.16634
  • Xiang Yue et al. MMMU: A Massive Multidisciplinary Multimodal Understanding and Reasoning Benchmark for Expert AGI. https://arxiv.org/abs/2311.16502
  • R. AlSaad et al. (2024). Multimodal Large Language Models in Health Care: Applications, Challenges, and Future Outlook. Journal of Medical Internet Research, vol. 26, e59505.
  • Y. Hu et al. (2025). Review: Medical Multimodal Large Language Models. ScienceDirect.
  • Zabir Ali Nazi (2024). Large Language Models in Healthcare and Medical Domain. Information (MDPI), vol. 11, no. 3, p. 57.
  • Enhancing Medical AI with Retrieval-Augmented Generation. PMC, 2025.
  • Comprehensive and Practical Evaluation of Retrieval-Augmented Generation for Medical QA. arXiv, 2024.
  • X. Zhao et al. (2025). MedRAG: Enhancing Retrieval-Augmented Generation with Knowledge Graph-Elicited Reasoning for Healthcare Copilot. arXiv.
  • J. Wu et al. (2024). Medical Graph RAG: Towards Safe Medical Large Language Model via Graph Retrieval-Augmented Generation. arXiv.
  • L. Buess et al. (2025). A Scoping Review on the Potential of Generative AI in Medicine. SpringerLink.
  • Retrieval-Augmented Generation in Biomedicine: A Survey. arXiv, 2025.
  • “Improving Medical Reasoning through Retrieval and Self-Reflection with Retrieval-Augmented Large Language Models,” arXiv, 2024.
  • Xiao, H., Zhou, F., Liu, X., Liu, T., Li, Z., Liu, X. and Huang, X., “A Comprehensive Survey of Large Language Models and Multimodal Large Language Models in Medicine,” arXiv, May 2024.
  • Singhal, R., et al., “Large language models encode clinical knowledge,” Nature, vol. 620, no. 7972, pp. 172–180, 2023.
  • Moor, A., et al., “Med-Flamingo: a multimodal medical few-shot learner,” Proceedings of Machine Learning Research, 2023.