Published in CVPR (2026).
Medical image segmentation remains challenging due to limited annotations for training, ambiguous anatomical features, and domain shifts. While vision–language models such as CLIP offer strong cross-modal representations, their potential for dense, text-guided medical image segmentation remains underexplored. We present MedCLIPSeg, a novel framework that adapts CLIP for robust, data-efficient, and uncertainty-aware medical image segmentation. Our approach leverages patch-level CLIP embeddings through probabilistic cross-modal attention, enabling bidirectional interaction between image and text tokens and explicit modeling of predictive uncertainty. Together with a soft patch-level contrastive loss that encourages nuanced semantic learning across diverse textual prompts, MedCLIPSeg improves data efficiency and domain generalizability. Extensive experiments across 16 datasets, spanning five imaging modalities and six organs, demonstrate that MedCLIPSeg outperforms prior methods in accuracy, efficiency, and robustness, while providing interpretable uncertainty maps that highlight the local reliability of segmentation results.
Overall architecture of MedCLIPSeg integrating probabilistic vision–language fusion into a CLIP-based segmentation pipeline.
Probabilistic Vision–Language (PVL) adapters for confidence-weighted, bidirectional cross-modal interaction.
We evaluate (i) data efficiency by training with 10% / 25% / 50% / 100% of available data, and (ii) domain generalization by training on an in-distribution source dataset and testing on unseen target datasets without adaptation. Metrics: DSC and NSD.
| Method | 10% Data | 25% Data | 50% Data | 100% Data | ||||
|---|---|---|---|---|---|---|---|---|
| DSC ↑ | NSD ↑ | DSC ↑ | NSD ↑ | DSC ↑ | NSD ↑ | DSC ↑ | NSD ↑ | |
| Unimodal Approaches | ||||||||
| UNet | 60.95 | 64.43 | 62.74 | 66.16 | 71.61 | 75.14 | 78.49 | 82.07 |
| UNet++ | 63.72 | 67.08 | 65.86 | 69.21 | 73.15 | 76.31 | 78.44 | 81.79 |
| DeepLabv3 | 61.32 | 64.84 | 65.39 | 69.10 | 68.58 | 72.57 | 73.28 | 77.42 |
| Attention U-Net | 62.78 | 66.25 | 64.97 | 68.53 | 71.34 | 74.96 | 76.30 | 79.77 |
| nnU-Net | 73.45 | 77.37 | 76.73 | 80.66 | 78.86 | 82.68 | 81.40 | 85.08 |
| Swin-UNet | 53.04 | 57.91 | 54.69 | 59.24 | 55.89 | 61.25 | 65.03 | 69.32 |
| TransUNet | 52.69 | 56.38 | 55.25 | 58.95 | 55.22 | 59.30 | 67.22 | 71.15 |
| Generic Text-driven Approaches | ||||||||
| LViT | 66.51 | 68.80 | 75.66 | 78.12 | 78.88 | 81.34 | 83.35 | 85.89 |
| Ariadne’s Thread | 61.34 | 62.75 | 63.09 | 64.51 | 65.65 | 66.92 | 70.07 | 71.49 |
| CLIP-based Approaches | ||||||||
| EoMT-CLIP | 74.07 | 77.41 | 76.29 | 79.84 | 79.19 | 82.78 | 82.93 | 86.35 |
| CLIPSeg | 74.66 | 77.75 | 78.31 | 81.34 | 79.63 | 82.58 | 84.87 | 87.74 |
| DenseCLIP | 67.84 | 70.33 | 70.23 | 72.70 | 72.09 | 74.45 | 74.19 | 76.89 |
| ZegCLIP | 61.25 | 63.72 | 72.46 | 75.01 | 76.21 | 78.80 | 78.98 | 81.69 |
| SAN | 74.13 | 76.97 | 76.13 | 78.91 | 78.80 | 81.52 | 81.62 | 84.35 |
| MaPLe | 66.27 | 68.75 | 71.53 | 73.95 | 74.60 | 77.12 | 74.60 | 77.10 |
| MaPLe + Decoder | 74.81 | 77.90 | 79.64 | 82.60 | 82.81 | 85.80 | 84.94 | 87.91 |
| VLSM-Adapter | 74.47 | 77.50 | 77.63 | 80.53 | 80.83 | 83.77 | 83.85 | 86.72 |
| CausalCLIPSeg | 71.19 | 73.74 | 75.42 | 78.00 | 78.60 | 81.22 | 81.34 | 84.20 |
| CAT-Seg | 78.76 | 81.50 | 81.12 | 83.92 | 83.32 | 85.61 | 85.90 | 88.31 |
| MedCLIPSeg (Ours) | 81.10 | 83.94 | 85.08 | 87.85 | 87.18 | 89.95 | 88.66 | 91.35 |
| Method | Breast Ultrasound | Polyp Endoscopy | Brain MRI | Skin Dermatoscopy | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| BUSI | BUSBRA | BUSUC | BUID | UDIAT | Kvasir-SEG | ColonDB | ClinicDB | BTMRI | BRISC | ISIC | UWaterloo | |
| LViT | 75.32 | 59.41 | 67.95 | 53.51 | 65.60 | 85.29 | 60.01 | 75.27 | 81.41 | 71.86 | 91.21 | 58.87 |
| CLIPSeg | 80.95 | 63.66 | 75.03 | 68.43 | 56.67 | 81.98 | 59.93 | 71.49 | 86.33 | 77.61 | 90.55 | 80.19 |
| DenseCLIP | 71.85 | 53.34 | 70.97 | 63.53 | 54.93 | 79.32 | 56.38 | 68.08 | 70.30 | 34.12 | 89.29 | 53.39 |
| ZegCLIP | 72.08 | 61.08 | 73.57 | 71.75 | 52.41 | 78.46 | 53.46 | 69.75 | 76.65 | 66.31 | 81.45 | 38.60 |
| SAN | 77.99 | 64.37 | 74.15 | 58.13 | 61.98 | 83.16 | 61.82 | 74.46 | 85.27 | 71.60 | 91.39 | 82.51 |
| MaPLe | 66.37 | 50.08 | 71.52 | 70.77 | 57.81 | 76.12 | 48.09 | 59.64 | 75.40 | 45.19 | 88.31 | 69.12 |
| MaPLe + Decoder | 80.49 | 55.89 | 64.96 | 60.66 | 59.44 | 83.46 | 61.53 | 71.20 | 85.08 | 71.46 | 90.10 | 81.83 |
| VLSM-Adapter | 80.90 | 68.48 | 82.37 | 75.26 | 69.16 | 85.89 | 63.51 | 76.09 | 85.03 | 68.92 | 91.30 | 82.17 |
| CausalCLIPSeg | 76.11 | 55.87 | 69.12 | 64.49 | 48.90 | 78.77 | 41.65 | 57.54 | 81.71 | 53.96 | 89.47 | 48.73 |
| CAT-Seg | 81.83 | 70.94 | 81.48 | 73.37 | 70.30 | 86.43 | 68.49 | 70.35 | 84.86 | 76.28 | 91.27 | 82.02 |
| MedCLIPSeg (Ours) | 85.72 | 75.06 | 84.37 | 78.99 | 74.64 | 90.15 | 71.90 | 80.80 | 88.03 | 80.92 | 92.54 | 83.53 |
MedCLIPSeg produces both a segmentation mask and a dense uncertainty map. Uncertainty tends to peak along ambiguous boundaries and challenging regions, and remains consistent across in-distribution and out-of-distribution samples—supporting interpretability and reliability review.
We analyze the contribution of MedCLIPSeg’s components and design choices, including PVL adapters, gating, probabilistic attention, bidirectional interaction, contrastive loss, prompt style, and CLIP backbone selection.
| Method | ID DSC (%) | OOD DSC (%) | HM DSC (%) |
|---|---|---|---|
| MedCLIPSeg (Ours) | 89.11 | 79.02 | 83.76 |
| Probabilistic Vision–Language Adapters | |||
| w/o PVL Adapters | 81.23 (−7.88)↓ | 55.23 (−23.79)↓ | 65.75 (−18.01)↓ |
| w/o Gating | 87.55 (−1.56)↓ | 76.79 (−2.23)↓ | 81.82 (−1.94)↓ |
| w/o AttnPVL | 86.21 (−2.90)↓ | 74.13 (−4.89)↓ | 79.71 (−4.05)↓ |
| Deterministic MedCLIPSeg | 87.68 (−1.43)↓ | 63.12 (−15.90)↓ | 73.40 (−10.36)↓ |
| Bidirectional Multimodal Interaction | |||
| w/o Visual Adaptation | 81.50 (−7.61)↓ | 64.40 (−14.62)↓ | 71.95 (−11.81)↓ |
| w/o Textual Adaptation | 88.83 (−0.28)↓ | 76.40 (−2.62)↓ | 82.15 (−1.61)↓ |
| w/o Bidirectional Interaction | 88.71 (−0.40)↓ | 77.71 (−1.31)↓ | 82.85 (−0.91)↓ |
| Unimodal MedCLIPSeg | 86.53 (−2.58)↓ | 73.49 (−5.53)↓ | 79.48 (−4.28)↓ |
| Contrastive Loss | |||
| w/o SoftCon Loss | 87.24 (−1.87)↓ | 77.08 (−1.94)↓ | 81.84 (−1.92)↓ |
| Hard Targets | 88.34 (−0.77)↓ | 77.64 (−1.38)↓ | 82.65 (−1.11)↓ |
| Attention-pooled SoftCon Loss | 88.73 (−0.38)↓ | 75.60 (−3.42)↓ | 81.64 (−2.12)↓ |
Layer-wise PVL adapter interventions and confidence weight (β) ablations averaged on ID and OOD data. Deeper interventions improve performance up to ~Layer 10, and β = 2.35 yields the best harmonic mean.
| Text Prompt Style | ID DSC (%) | OOD DSC (%) | HM DSC (%) |
|---|---|---|---|
| Contradictory | 68.60 | 63.21 | 65.79 |
| Missing Location | 86.98 | 77.75 | 82.11 |
| Overdescriptive | 82.93 | 74.49 | 78.48 |
| Underdescriptive | 66.91 | 49.38 | 56.82 |
| Original | 89.11 | 79.02 | 83.76 |
| Pre-trained Model | ID DSC (%) | OOD DSC (%) | HM DSC (%) |
|---|---|---|---|
| CLIP | 88.48 | 74.81 | 81.07 |
| PubMedCLIP | 86.67 | 73.05 | 79.28 |
| BiomedCLIP | 88.70 | 77.08 | 82.48 |
| UniMedCLIP | 89.11 | 79.02 | 83.76 |
@inproceedings{koleilat2026medclipseg,
title = {MedCLIPSeg: Probabilistic Vision--Language Adaptation for Data-Efficient and Generalizable Medical Image Segmentation},
author = {Koleilat, Taha and Asgariandehkordi, Hojat and Nejati Manzari, Omid and Barile, Berardino and Xiao, Yiming and Rivaz, Hassan},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2026}
}