MedCLIP-SAMv2: Towards Universal Text-Driven Medical Image Segmentation

Taha Koleilat · Hojat Asgariandehkordi · Hassan Rivaz · Yiming Xiao
Concordia University · IMPACT Lab · Health-X Lab

Published in Medical Image Analysis (2025).

MedCLIP-SAMv2 framework overview

MedCLIP-SAMv2 integrates BiomedCLIP fine-tuning (DHN-NCE), text-driven saliency (M2IB), SAM-based refinement, and uncertainty-aware weak supervision.

Abstract

Segmentation of anatomical structures and pathologies in medical images is essential for modern disease diagnosis, clinical research, and treatment planning. While significant advancements have been made in deep learning-based segmentation techniques, many of these methods still suffer from limitations in data efficiency, generalizability, and interactivity. Recently, foundation models like CLIP and Segment-Anything-Model (SAM) have paved the way for interactive and universal image segmentation. In this work, we introduce MedCLIP-SAMv2, a framework that integrates BiomedCLIP and SAM to perform text-driven medical image segmentation in zero-shot and weakly supervised settings. The approach fine-tunes BiomedCLIP with a new DHN-NCE loss and leverages M2IB to create visual prompts for SAM; we also explore uncertainty-aware refinement via checkpoint ensembling.

Overview

  • DHN-NCE Fine-tuning: Improves BiomedCLIP cross-modal alignment by emphasizing hard negatives and decoupling positives for efficient training.
  • Text-driven Saliency (M2IB): Generates attribution maps conditioned on prompts (including LLM-generated descriptions).
  • SAM Refinement: Converts saliency into box/point prompts and refines masks with SAM.
  • Weak Supervision + Uncertainty: Trains nnUNet on pseudo-labels; checkpoint ensembling yields refined masks and uncertainty maps.
Essential components of MedCLIP-SAMv2
Essential components of the framework.
DHN-NCE loss illustration
DHN-NCE prioritizes hard negatives compared to standard CLIP-style contrastive loss.

Key Results

We evaluate on four modalities/tasks: breast tumor ultrasound, brain tumor MRI, lung X-ray, and lung CT. Below are the main comparison and ablation tables.

Comparison with SOTA Methods

Setting Method Breast US Brain MRI Lung X-ray Lung CT All
DSC↑NSD↑ DSC↑NSD↑ DSC↑NSD↑ DSC↑NSD↑ DSC↑NSD↑
Zero-shot
Zero-shotSaLIP44.33±10.1248.62±10.2547.96±9.1450.24±9.2663.14±11.3466.44±11.5876.32±11.2278.46±11.3557.94±10.4960.94±10.65
Zero-shotSAMAug56.39±10.8559.23±10.9245.71±10.3448.81±11.2957.18±12.1260.08±12.3444.61±10.4246.48±10.5750.97±10.9653.65±11.30
Zero-shotMedCLIP-SAM67.82±8.2669.12±9.1266.72±5.2768.01±6.1664.49±9.0965.89±10.4459.14±9.5260.47±9.9864.54±8.2066.10±9.08
Zero-shotMedCLIP-SAMv2 (Ours) 77.76±9.5281.11±9.89 76.52±7.0682.23±7.13 75.79±3.4480.88±3.52 80.38±5.8182.03±5.94 77.61±6.8281.56±7.00
Weakly supervised
WeaknnUNet (pseudo-labels)73.77±14.4879.71±14.7977.16±12.1785.21±12.6070.15±6.4074.10±6.5982.24±5.1285.65±4.7075.83±10.3181.17±10.52
WeakMedCLIP-SAM58.62±5.6660.94±5.8758.80±8.6361.77±8.6486.07±8.6188.65±8.0980.12±8.3883.73±8.2970.90±7.9273.77±7.80
WeakMedCLIP-SAMv2 (Ours) 78.87±12.2984.58±12.19 80.03±9.9188.25±10.04 80.77±4.4484.53±4.51 88.78±4.4391.95±4.06 82.11±8.4987.33±8.46

DHN-NCE Improves Cross-modal Retrieval

Model Version / Loss image → text (%) text → image (%)
Top-1Top-2 Top-1Top-2
CLIPPre-trained26.68±0.3041.80±0.1926.17±0.2041.13±0.20
PMC-CLIPPre-trained75.47±0.3787.46±0.1176.78±0.1188.35±0.19
BiomedCLIPPre-trained81.83±0.2092.79±0.1381.36±0.4892.27±0.14
BiomedCLIPInfoNCE84.21±0.3594.47±0.1985.73±0.1994.99±0.16
BiomedCLIPDCL84.44±0.3794.68±0.1985.89±0.1695.09±0.19
BiomedCLIPHN-NCE84.33±0.3594.60±0.1985.80±0.1795.10±0.19
BiomedCLIPDHN-NCE (Ours)84.70±0.3394.73±0.1685.99±0.1995.17±0.19

Ablation: Contribution of Each Component

Step Method DSC↑ NSD↑
1Saliency maps46.23±8.5850.50±8.86
2+ DHN-NCE fine-tuning49.10±8.4653.54±8.62
3+ Post-processing51.62±7.5755.23±7.47
4+ Connected component analysis57.89±7.8761.54±8.02
5+ SAM77.61±6.8281.56±7.00
6+ nnUNet ensemble82.11±8.4987.33±8.46

Qualitative Results

We present qualitative segmentation results produced by MedCLIP-SAMv2 across diverse imaging modalities, including brain MRI, breast ultrasound, chest X-ray, and chest CT. The results demonstrate that text-driven prompts, combined with BiomedCLIP-based saliency and SAM-based refinement, yield accurate and coherent segmentations that align well with anatomical structures and pathological regions, even in the absence of pixel-level supervision.

Sample segmentation results
Sample text-driven segmentation outputs across modalities.

Cross-Modal Saliency Comparison

We compare text-driven saliency maps generated using BiomedCLIP and standard CLIP across four representative medical imaging modalities: brain MRI, breast ultrasound, chest X-ray, and chest CT. BiomedCLIP consistently produces more localized and anatomically meaningful responses aligned with clinically relevant regions, while standard CLIP often exhibits diffuse or semantically ambiguous activations.

Comparison of BiomedCLIP and CLIP saliency maps across modalities
Top: Input images. Middle: BiomedCLIP saliency maps. Bottom: CLIP saliency maps. Across all modalities, BiomedCLIP demonstrates improved localization and clinical relevance compared to standard CLIP.

Colab Demo

Interactive notebook demo:

Colab Open in Colab

BibTeX

@article{koleilat2025medclipsamv2,
  title={Medclip-samv2: Towards universal text-driven medical image segmentation},
  author={Koleilat, Taha and Asgariandehkordi, Hojat and Rivaz, Hassan and Xiao, Yiming},
  journal={Medical Image Analysis},
  pages={103749},
  year={2025},
  publisher={Elsevier}
}

@inproceedings{koleilat2024medclip,
  title={MedCLIP-SAM: Bridging text and image towards universal medical image segmentation},
  author={Koleilat, Taha and Asgariandehkordi, Hojat and Rivaz, Hassan and Xiao, Yiming},
  booktitle={International Conference on Medical Image Computing and Computer-Assisted Intervention},
  pages={643--653},
  year={2024},
  organization={Springer}
}