Millions lack access to specialized dermatological care due to geographic and technological disparities. We present a novel multimodal framework that combines image-based diagnosis with a visual-question answering pipeline, powered by DINOv2 and a compressed LLaVA model. Our system supports accurate skin disease diagnosis and explanation, optimized for low-resource settings.

This project introduces a clinical-grade Visual Language Model (VLM) capable of dermatological diagnosis using natural language prompts and images. Our AI assistant is trained via four stages: auxiliary classification, medical reasoning, interaction optimization, and resource-efficient deployment through structured pruning. The final model achieves 82.05% diagnostic accuracy and a 9/10 patient interaction score, even when operating within <4.5GB of memory.

Key contributions:

  • Integration of DINOv2 and LLaVA for robust image-text understanding.
  • Domain-specific fine-tuning and question-answering for medical settings.
  • Progressive enhancement through reasoning, DPO, and pruning.
  • Local and global impact potential—especially in under-resourced areas.

📄 View Poster Below: Methodlogy figure

Updated: