Published in Medical Image Analysis (2025).
Segmentation of anatomical structures and pathologies in medical images is essential for modern disease diagnosis, clinical research, and treatment planning. While significant advancements have been made in deep learning-based segmentation techniques, many of these methods still suffer from limitations in data efficiency, generalizability, and interactivity. Recently, foundation models like CLIP and Segment-Anything-Model (SAM) have paved the way for interactive and universal image segmentation. In this work, we introduce MedCLIP-SAMv2, a framework that integrates BiomedCLIP and SAM to perform text-driven medical image segmentation in zero-shot and weakly supervised settings. The approach fine-tunes BiomedCLIP with a new DHN-NCE loss and leverages M2IB to create visual prompts for SAM; we also explore uncertainty-aware refinement via checkpoint ensembling.
We evaluate on four modalities/tasks: breast tumor ultrasound, brain tumor MRI, lung X-ray, and lung CT. Below are the main comparison and ablation tables.
| Setting | Method | Breast US | Brain MRI | Lung X-ray | Lung CT | All | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| DSC↑ | NSD↑ | DSC↑ | NSD↑ | DSC↑ | NSD↑ | DSC↑ | NSD↑ | DSC↑ | NSD↑ | ||
| Zero-shot | |||||||||||
| Zero-shot | SaLIP | 44.33±10.12 | 48.62±10.25 | 47.96±9.14 | 50.24±9.26 | 63.14±11.34 | 66.44±11.58 | 76.32±11.22 | 78.46±11.35 | 57.94±10.49 | 60.94±10.65 |
| Zero-shot | SAMAug | 56.39±10.85 | 59.23±10.92 | 45.71±10.34 | 48.81±11.29 | 57.18±12.12 | 60.08±12.34 | 44.61±10.42 | 46.48±10.57 | 50.97±10.96 | 53.65±11.30 |
| Zero-shot | MedCLIP-SAM | 67.82±8.26 | 69.12±9.12 | 66.72±5.27 | 68.01±6.16 | 64.49±9.09 | 65.89±10.44 | 59.14±9.52 | 60.47±9.98 | 64.54±8.20 | 66.10±9.08 |
| Zero-shot | MedCLIP-SAMv2 (Ours) | 77.76±9.52 | 81.11±9.89 | 76.52±7.06 | 82.23±7.13 | 75.79±3.44 | 80.88±3.52 | 80.38±5.81 | 82.03±5.94 | 77.61±6.82 | 81.56±7.00 |
| Weakly supervised | |||||||||||
| Weak | nnUNet (pseudo-labels) | 73.77±14.48 | 79.71±14.79 | 77.16±12.17 | 85.21±12.60 | 70.15±6.40 | 74.10±6.59 | 82.24±5.12 | 85.65±4.70 | 75.83±10.31 | 81.17±10.52 |
| Weak | MedCLIP-SAM | 58.62±5.66 | 60.94±5.87 | 58.80±8.63 | 61.77±8.64 | 86.07±8.61 | 88.65±8.09 | 80.12±8.38 | 83.73±8.29 | 70.90±7.92 | 73.77±7.80 |
| Weak | MedCLIP-SAMv2 (Ours) | 78.87±12.29 | 84.58±12.19 | 80.03±9.91 | 88.25±10.04 | 80.77±4.44 | 84.53±4.51 | 88.78±4.43 | 91.95±4.06 | 82.11±8.49 | 87.33±8.46 |
| Model | Version / Loss | image → text (%) | text → image (%) | ||
|---|---|---|---|---|---|
| Top-1 | Top-2 | Top-1 | Top-2 | ||
| CLIP | Pre-trained | 26.68±0.30 | 41.80±0.19 | 26.17±0.20 | 41.13±0.20 |
| PMC-CLIP | Pre-trained | 75.47±0.37 | 87.46±0.11 | 76.78±0.11 | 88.35±0.19 |
| BiomedCLIP | Pre-trained | 81.83±0.20 | 92.79±0.13 | 81.36±0.48 | 92.27±0.14 |
| BiomedCLIP | InfoNCE | 84.21±0.35 | 94.47±0.19 | 85.73±0.19 | 94.99±0.16 |
| BiomedCLIP | DCL | 84.44±0.37 | 94.68±0.19 | 85.89±0.16 | 95.09±0.19 |
| BiomedCLIP | HN-NCE | 84.33±0.35 | 94.60±0.19 | 85.80±0.17 | 95.10±0.19 |
| BiomedCLIP | DHN-NCE (Ours) | 84.70±0.33 | 94.73±0.16 | 85.99±0.19 | 95.17±0.19 |
| Step | Method | DSC↑ | NSD↑ |
|---|---|---|---|
| 1 | Saliency maps | 46.23±8.58 | 50.50±8.86 |
| 2 | + DHN-NCE fine-tuning | 49.10±8.46 | 53.54±8.62 |
| 3 | + Post-processing | 51.62±7.57 | 55.23±7.47 |
| 4 | + Connected component analysis | 57.89±7.87 | 61.54±8.02 |
| 5 | + SAM | 77.61±6.82 | 81.56±7.00 |
| 6 | + nnUNet ensemble | 82.11±8.49 | 87.33±8.46 |
We present qualitative segmentation results produced by MedCLIP-SAMv2 across diverse imaging modalities, including brain MRI, breast ultrasound, chest X-ray, and chest CT. The results demonstrate that text-driven prompts, combined with BiomedCLIP-based saliency and SAM-based refinement, yield accurate and coherent segmentations that align well with anatomical structures and pathological regions, even in the absence of pixel-level supervision.
We compare text-driven saliency maps generated using BiomedCLIP and standard CLIP across four representative medical imaging modalities: brain MRI, breast ultrasound, chest X-ray, and chest CT. BiomedCLIP consistently produces more localized and anatomically meaningful responses aligned with clinically relevant regions, while standard CLIP often exhibits diffuse or semantically ambiguous activations.
Interactive notebook demo:
@article{koleilat2025medclipsamv2,
title={Medclip-samv2: Towards universal text-driven medical image segmentation},
author={Koleilat, Taha and Asgariandehkordi, Hojat and Rivaz, Hassan and Xiao, Yiming},
journal={Medical Image Analysis},
pages={103749},
year={2025},
publisher={Elsevier}
}
@inproceedings{koleilat2024medclip,
title={MedCLIP-SAM: Bridging text and image towards universal medical image segmentation},
author={Koleilat, Taha and Asgariandehkordi, Hojat and Rivaz, Hassan and Xiao, Yiming},
booktitle={International Conference on Medical Image Computing and Computer-Assisted Intervention},
pages={643--653},
year={2024},
organization={Springer}
}