MedCLIP-SAM: Bridging Text and Image Towards Universal Medical Image Segmentation

Taha Koleilat · Hojat Asgariandehkordi · Hassan Rivaz · Yiming Xiao
Concordia University · IMPACT Lab · Health-X Lab

Published at MICCAI 2024.

MedCLIP-SAM enables interactive, text-driven medical image segmentation by combining BiomedCLIP saliency, CRF refinement, and SAM-based mask generation in zero-shot and weakly supervised settings.

Abstract

Medical image segmentation of anatomical structures and pathology is crucial in modern clinical diagnosis, disease study, and treatment planning. While deep learning-based methods have achieved strong performance, they often lack data efficiency, generalizability, and interactability. In this work, we propose MedCLIP-SAM, a novel framework that bridges vision–language models and segmentation foundation models to enable text-driven universal medical image segmentation. MedCLIP-SAM integrates BiomedCLIP fine-tuned with a Decoupled Hard Negative Noise Contrastive Estimation (DHN-NCE) loss, gScoreCAM-based saliency generation, CRF post-processing, and Segment Anything Model (SAM) refinement. The framework supports both zero-shot and weakly supervised segmentation and is validated across breast ultrasound, brain MRI, and chest X-ray datasets, demonstrating strong accuracy and generalization.

Overview

  • DHN-NCE Fine-tuning: Efficiently adapts BiomedCLIP to medical images using hard negatives and decoupled contrastive learning.
  • Text-driven Saliency: gScoreCAM generates class-relevant saliency maps from text prompts.
  • CRF Post-processing: Refines saliency maps into coarse pseudo-masks.
  • SAM Refinement: Uses box prompts derived from saliency to produce high-quality segmentation masks.
  • Weak Supervision: Zero-shot masks can optionally supervise downstream segmentation networks.
MedCLIP-SAM framework
Overview of the MedCLIP-SAM framework.

Qualitative Results

We present representative qualitative results of MedCLIP-SAM across three medical imaging modalities: breast ultrasound, brain MRI, and chest X-ray. The results highlight the effectiveness of text-driven prompts in guiding segmentation without requiring pixel-level annotations, while SAM refinement improves boundary adherence and anatomical consistency.

MedCLIP-SAM qualitative results
Qualitative comparison of segmentation results across modalities. From left to right: input image, CLIP-based saliency, zero-shot SAM segmentation, weakly supervised refinement, and ground truth.

Colab Demo

Open in Colab

BibTeX

@inproceedings{koleilat2024medclip,
  title={MedCLIP-SAM: Bridging text and image towards universal medical image segmentation},
  author={Koleilat, Taha and Asgariandehkordi, Hojat and Rivaz, Hassan and Xiao, Yiming},
  booktitle={International Conference on Medical Image Computing and Computer-Assisted Intervention},
  pages={643--653},
  year={2024},
  organization={Springer}
}