MedCLIPSeg: Probabilistic Vision–Language Adaptation for Data-Efficient and Generalizable Medical Image Segmentation

Taha Koleilat · Hojat Asgariandehkordi · Omid Nejati Manzari · Berardino Barile · Yiming Xiao · Hassan Rivaz

Published in CVPR (2026).

MedCLIPSeg overview figure

Probabilistic cross-modal fusion for CLIP-based medical image segmentation, modeling visual–text representations as distributions to improve robustness and uncertainty calibration across in-distribution and out-of-distribution data.

Abstract

Medical image segmentation remains challenging due to limited annotations for training, ambiguous anatomical features, and domain shifts. While vision–language models such as CLIP offer strong cross-modal representations, their potential for dense, text-guided medical image segmentation remains underexplored. We present MedCLIPSeg, a novel framework that adapts CLIP for robust, data-efficient, and uncertainty-aware medical image segmentation. Our approach leverages patch-level CLIP embeddings through probabilistic cross-modal attention, enabling bidirectional interaction between image and text tokens and explicit modeling of predictive uncertainty. Together with a soft patch-level contrastive loss that encourages nuanced semantic learning across diverse textual prompts, MedCLIPSeg improves data efficiency and domain generalizability. Extensive experiments across 16 datasets, spanning five imaging modalities and six organs, demonstrate that MedCLIPSeg outperforms prior methods in accuracy, efficiency, and robustness, while providing interpretable uncertainty maps that highlight the local reliability of segmentation results.

Method

  • Bidirectional Vision–Language Fusion: Representation-level adapters enable two-way interaction between image patches and text tokens while keeping CLIP encoders frozen.
  • Probabilistic Cross-Modal Attention (PVL): Variational Key and Value distributions model uncertainty; confidence-weighted attention downweights unreliable tokens.
  • Pixel-Level Uncertainty Estimation: Monte Carlo sampling of Value distributions yields mean masks and entropy-based uncertainty maps.
  • Soft Patch-Level Contrastive Loss: Encourages nuanced alignment across diverse prompts and improves generalization under limited supervision.
Overall architecture of MedCLIPSeg

Overall architecture of MedCLIPSeg integrating probabilistic vision–language fusion into a CLIP-based segmentation pipeline.

PVL adapter schematic

Probabilistic Vision–Language (PVL) adapters for confidence-weighted, bidirectional cross-modal interaction.

Results

We evaluate (i) data efficiency by training with 10% / 25% / 50% / 100% of available data, and (ii) domain generalization by training on an in-distribution source dataset and testing on unseen target datasets without adaptation. Metrics: DSC and NSD.

Data-Efficiency Evaluation

Method 10% Data 25% Data 50% Data 100% Data
DSC ↑NSD ↑ DSC ↑NSD ↑ DSC ↑NSD ↑ DSC ↑NSD ↑
Unimodal Approaches
UNet60.9564.4362.7466.1671.6175.1478.4982.07
UNet++63.7267.0865.8669.2173.1576.3178.4481.79
DeepLabv361.3264.8465.3969.1068.5872.5773.2877.42
Attention U-Net62.7866.2564.9768.5371.3474.9676.3079.77
nnU-Net73.4577.3776.7380.6678.8682.6881.4085.08
Swin-UNet53.0457.9154.6959.2455.8961.2565.0369.32
TransUNet52.6956.3855.2558.9555.2259.3067.2271.15
Generic Text-driven Approaches
LViT66.5168.8075.6678.1278.8881.3483.3585.89
Ariadne’s Thread61.3462.7563.0964.5165.6566.9270.0771.49
CLIP-based Approaches
EoMT-CLIP74.0777.4176.2979.8479.1982.7882.9386.35
CLIPSeg74.6677.7578.3181.3479.6382.5884.8787.74
DenseCLIP67.8470.3370.2372.7072.0974.4574.1976.89
ZegCLIP61.2563.7272.4675.0176.2178.8078.9881.69
SAN74.1376.9776.1378.9178.8081.5281.6284.35
MaPLe66.2768.7571.5373.9574.6077.1274.6077.10
MaPLe + Decoder74.8177.9079.6482.6082.8185.8084.9487.91
VLSM-Adapter74.4777.5077.6380.5380.8383.7783.8586.72
CausalCLIPSeg71.1973.7475.4278.0078.6081.2281.3484.20
CAT-Seg78.7681.5081.1283.9283.3285.6185.9088.31
MedCLIPSeg (Ours) 81.1083.94 85.0887.85 87.1889.95 88.6691.35

Domain Generalization (DSC %)

Method Breast Ultrasound Polyp Endoscopy Brain MRI Skin Dermatoscopy
BUSIBUSBRABUSUCBUID UDIATKvasir-SEGColonDBClinicDB BTMRIBRISC ISICUWaterloo
LViT75.3259.4167.9553.5165.6085.2960.0175.2781.4171.8691.2158.87
CLIPSeg80.9563.6675.0368.4356.6781.9859.9371.4986.3377.6190.5580.19
DenseCLIP71.8553.3470.9763.5354.9379.3256.3868.0870.3034.1289.2953.39
ZegCLIP72.0861.0873.5771.7552.4178.4653.4669.7576.6566.3181.4538.60
SAN77.9964.3774.1558.1361.9883.1661.8274.4685.2771.6091.3982.51
MaPLe66.3750.0871.5270.7757.8176.1248.0959.6475.4045.1988.3169.12
MaPLe + Decoder80.4955.8964.9660.6659.4483.4661.5371.2085.0871.4690.1081.83
VLSM-Adapter80.9068.4882.3775.2669.1685.8963.5176.0985.0368.9291.3082.17
CausalCLIPSeg76.1155.8769.1264.4948.9078.7741.6557.5481.7153.9689.4748.73
CAT-Seg81.8370.9481.4873.3770.3086.4368.4970.3584.8676.2891.2782.02
MedCLIPSeg (Ours) 85.7275.0684.3778.99 74.6490.1571.9080.80 88.0380.92 92.5483.53

Segmentation & Uncertainty Visualization

MedCLIPSeg produces both a segmentation mask and a dense uncertainty map. Uncertainty tends to peak along ambiguous boundaries and challenging regions, and remains consistent across in-distribution and out-of-distribution samples—supporting interpretability and reliability review.

Segmentation and uncertainty visualization
Example predictions with uncertainty maps. ID datasets are in blue while OOD datasets are in red.

Ablation Studies

We analyze the contribution of MedCLIPSeg’s components and design choices, including PVL adapters, gating, probabilistic attention, bidirectional interaction, contrastive loss, prompt style, and CLIP backbone selection.

Effectiveness of Key Design Components

Method ID DSC (%) OOD DSC (%) HM DSC (%)
MedCLIPSeg (Ours) 89.11 79.02 83.76
Probabilistic Vision–Language Adapters
w/o PVL Adapters 81.23 (−7.88) 55.23 (−23.79) 65.75 (−18.01)
w/o Gating 87.55 (−1.56) 76.79 (−2.23) 81.82 (−1.94)
w/o AttnPVL 86.21 (−2.90) 74.13 (−4.89) 79.71 (−4.05)
Deterministic MedCLIPSeg 87.68 (−1.43) 63.12 (−15.90) 73.40 (−10.36)
Bidirectional Multimodal Interaction
w/o Visual Adaptation 81.50 (−7.61) 64.40 (−14.62) 71.95 (−11.81)
w/o Textual Adaptation 88.83 (−0.28) 76.40 (−2.62) 82.15 (−1.61)
w/o Bidirectional Interaction 88.71 (−0.40) 77.71 (−1.31) 82.85 (−0.91)
Unimodal MedCLIPSeg 86.53 (−2.58) 73.49 (−5.53) 79.48 (−4.28)
Contrastive Loss
w/o SoftCon Loss 87.24 (−1.87) 77.08 (−1.94) 81.84 (−1.92)
Hard Targets 88.34 (−0.77) 77.64 (−1.38) 82.65 (−1.11)
Attention-pooled SoftCon Loss 88.73 (−0.38) 75.60 (−3.42) 81.64 (−2.12)

Layer-wise PVL Intervention and Confidence Weight (β)

Layer-wise interventions and confidence weight (beta) ablations

Layer-wise PVL adapter interventions and confidence weight (β) ablations averaged on ID and OOD data. Deeper interventions improve performance up to ~Layer 10, and β = 2.35 yields the best harmonic mean.

Effect of Text Prompt Design

Text Prompt Style ID DSC (%) OOD DSC (%) HM DSC (%)
Contradictory68.6063.2165.79
Missing Location86.9877.7582.11
Overdescriptive82.9374.4978.48
Underdescriptive66.9149.3856.82
Original 89.11 79.02 83.76

Effect of Pre-trained Vision–Language Models

Pre-trained Model ID DSC (%) OOD DSC (%) HM DSC (%)
CLIP88.4874.8181.07
PubMedCLIP86.6773.0579.28
BiomedCLIP88.7077.0882.48
UniMedCLIP 89.11 79.02 83.76

BibTeX

@inproceedings{koleilat2026medclipseg,
  title     = {MedCLIPSeg: Probabilistic Vision--Language Adaptation for Data-Efficient and Generalizable Medical Image Segmentation},
  author    = {Koleilat, Taha and Asgariandehkordi, Hojat and Nejati Manzari, Omid and Barile, Berardino and Xiao, Yiming and Rivaz, Hassan},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}