Efficient Medical MLLMs: Compressing LLaVA via Pruning and Quantization
Multimodal Large Language Models (MLLMs) hold huge potential for usage in the medical domain, merging visual understanding with language processing capabilities. However, their significant computational costs often necessitate specialized hardware and limit widespread adoption, creating a need for efficient compression techniques. This project evaluates the impact of structural pruning and activation-aware quantization on a LLaVA model specifically fine-tuned for medical applications. We delve into various layer selection methodologies for pruning and explore different quantization techniques, assessing the performance trade-offs within a structured prune-Supervised Fine Tuning (SFT)-quantize pipeline.
Our proposed method demonstrates significant efficiency gains: we successfully enable MLLMs with 7 billion parameters to operate within just 4GB of VRAM. This represents a substantial reduction in memory usage by 70%. Crucially, this efficiency does not come at the cost of performance; our approach achieves 4% higher accuracy compared to traditional pruning and quantization techniques when benchmarked at the same compression ratio.
For implementation details, code, and further exploration, please visit the project repository: