Published at MICCAI 2024.
Medical image segmentation of anatomical structures and pathology is crucial in modern clinical diagnosis, disease study, and treatment planning. While deep learning-based methods have achieved strong performance, they often lack data efficiency, generalizability, and interactability. In this work, we propose MedCLIP-SAM, a novel framework that bridges vision–language models and segmentation foundation models to enable text-driven universal medical image segmentation. MedCLIP-SAM integrates BiomedCLIP fine-tuned with a Decoupled Hard Negative Noise Contrastive Estimation (DHN-NCE) loss, gScoreCAM-based saliency generation, CRF post-processing, and Segment Anything Model (SAM) refinement. The framework supports both zero-shot and weakly supervised segmentation and is validated across breast ultrasound, brain MRI, and chest X-ray datasets, demonstrating strong accuracy and generalization.
We present representative qualitative results of MedCLIP-SAM across three medical imaging modalities: breast ultrasound, brain MRI, and chest X-ray. The results highlight the effectiveness of text-driven prompts in guiding segmentation without requiring pixel-level annotations, while SAM refinement improves boundary adherence and anatomical consistency.
@inproceedings{koleilat2024medclip,
title={MedCLIP-SAM: Bridging text and image towards universal medical image segmentation},
author={Koleilat, Taha and Asgariandehkordi, Hojat and Rivaz, Hassan and Xiao, Yiming},
booktitle={International Conference on Medical Image Computing and Computer-Assisted Intervention},
pages={643--653},
year={2024},
organization={Springer}
}