Evaluation of a Multimodal Custom Finetuned LLM for Virtual Healthcare Consultations

Pranav Upadhyaya

Multimedia ResearchISSN:2582-547X

Evaluation of a Multimodal Custom Finetuned LLM for Virtual Healthcare Consultations

Pranav Upadhyaya

Volume 9 |
Issue 1 |
January 2026

Abstract

We built a modular and privacy-focused prototype of a multimodal virtual medical assistant that uses retrieval- augmented generation (RAG) to improve healthcare consultations. The motivation behind this system is to bridge the gap between traditional telemedicine and intelligent diagnostic support by enabling AI-driven consultations that are context- aware, multimodal, and privacy-preserving. The system runs a locally deployed LLaMA 3.2 (11B) model with 4-bit quantization, keeping it lightweight yet efficient. It can process both text and images, and has been fine-tuned on 50,000 image label pairs from the MedTrinity dataset, which includes a wide range of medical images and descriptions. This fine- tuning improves the model’s ability to answer multimodal medical questions. To enhance interpretability, the model’s outputs are supported by transparent reasoning traces that indicate whether the response is derived from visual understanding, textual retrieval, or both.The assistant supports text, image, and speech inputs. Speech is transcribed using the AssemblyAI transcription API. For RAG, we use ChromaDB to store and retrieve medical documents from the MedQuAD dataset, which includes about 41,000 medicine-related question answer pairs. This integration enables the system to fetch domain-relevant evidence dynamically, helping users verify the medical reliability of generated responses. We evaluate our fine-tuned model against the base LLaMA 3.2 model and the responses are judged using OpenAI’s GPT-4.1 as an evaluator. Performance is measured on the MMMU benchmark, focusing on three medical domains:1) Basic medical science, 2) Clinical medicine, 3) Diagnostic and laboratory medicine. Each model variant (with and without RAG) was tested on 30 questions per domain, and evaluated under strict and non-strict scoring criteria. The evaluation reveals that fine-tuning significantly enhances answer relevance and domain fluency, while RAG contributes variably depending on retrieval quality, underscoring the need for domain-specific curation in medical AI systems.

References

Arvind Kasthuri (2018). Challenges to Healthcare in India – The Five As. https://pmc.ncbi.nlm.nih.gov/articles/PMC6166510/
Asma Ben Abacha and Dina Demner Fushman (2019). A Question-Entailment Approach to Question Answering. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3119-4

Daniel Han, Michael Han, and Unsloth Team (2023). Unsloth. http://github.com/unslothai/unsloth

Yunfei Xie et al. MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine. https://arxiv.org/abs/2408.02900

Touvron, H., et al. (2024). Llama 3: Herd of Models. https://arxiv.org/abs/2407.21783

Dettmers, T., Pagnoni, A., Holtzman, A., Zettlemoyer, L. QLoRA: Efficient Fine-tuning of Quantized LLMs. https://arxiv.org/abs/2305.14314

Yang Liu et al. G-Eval: NLG Evaluation Using GPT-4 with Better Human Alignment. https://arxiv.org/abs/2303.16634

Xiang Yue et al. MMMU: A Massive Multidisciplinary Multimodal Understanding and Reasoning Benchmark for Expert AGI. https://arxiv.org/abs/2311.16502

R. AlSaad et al. (2024). Multimodal Large Language Models in Health Care: Applications, Challenges, and Future Outlook. Journal of Medical Internet Research, vol. 26, e59505.

Y. Hu et al. (2025). Review: Medical Multimodal Large Language Models. ScienceDirect.

Zabir Ali Nazi (2024). Large Language Models in Healthcare and Medical Domain. Information (MDPI), vol. 11, no. 3, p. 57.

Enhancing Medical AI with Retrieval-Augmented Generation. PMC, 2025.

Comprehensive and Practical Evaluation of Retrieval-Augmented Generation for Medical QA. arXiv, 2024.

X. Zhao et al. (2025). MedRAG: Enhancing Retrieval-Augmented Generation with Knowledge Graph-Elicited Reasoning for Healthcare Copilot. arXiv.

J. Wu et al. (2024). Medical Graph RAG: Towards Safe Medical Large Language Model via Graph Retrieval-Augmented Generation. arXiv.

L. Buess et al. (2025). A Scoping Review on the Potential of Generative AI in Medicine. SpringerLink.

Retrieval-Augmented Generation in Biomedicine: A Survey. arXiv, 2025.

“Improving Medical Reasoning through Retrieval and Self-Reflection with Retrieval-Augmented Large Language Models,” arXiv, 2024.

Xiao, H., Zhou, F., Liu, X., Liu, T., Li, Z., Liu, X. and Huang, X., “A Comprehensive Survey of Large Language Models and Multimodal Large Language Models in Medicine,” arXiv, May 2024.

Singhal, R., et al., “Large language models encode clinical knowledge,” Nature, vol. 620, no. 7972, pp. 172–180, 2023.

Moor, A., et al., “Med-Flamingo: a multimodal medical few-shot learner,” Proceedings of Machine Learning Research, 2023.

Access options

Price $
Offer $

Buy article PDF

DOI : https://doi.org/10.46253/j.mr.v9i1.a4

Publisher Information

Received
29 October
Revised
26 November
Accepted
21 January

Keywords

Multimodal, Retrieval-Augmented Generation (RAG), ChromaDB, LLaMA 3.2 11B, 4-bit Quantization, GPT-4.1, Virtual Healthcare Consultations, AI in Medicine

Multimedia ResearchISSN:2582-547X

Evaluation of a Multimodal Custom Finetuned LLM for Virtual Healthcare Consultations

Abstract

References

Access options

DOI : https://doi.org/10.46253/j.mr.v9i1.a4

Author information

Affiliations

Publisher Information

Multimodal, Retrieval-Augmented Generation (RAG), ChromaDB, LLaMA 3.2 11B, 4-bit Quantization, GPT-4.1, Virtual Healthcare Consultations, AI in Medicine

Publisher : Resbee Info Technologies Pvt Ltd