MedCLIPSeg: Probabilistic Vision–Language Adaptation for Data-Efficient and Generalizable Medical Image Segmentation

Taha Koleilat · Hojat Asgariandehkordi · Omid Nejati Manzari · Berardino Barile · Yiming Xiao · Hassan Rivaz

IMPACT Lab · Health-X Lab

arXiv

Published in CVPR (2026).

Probabilistic cross-modal fusion for CLIP-based medical image segmentation, modeling visual–text representations as distributions to improve robustness and uncertainty calibration across in-distribution and out-of-distribution data.

Abstract

Medical image segmentation remains challenging due to limited annotations for training, ambiguous anatomical features, and domain shifts. While vision–language models such as CLIP offer strong cross-modal representations, their potential for dense, text-guided medical image segmentation remains underexplored. We present MedCLIPSeg, a novel framework that adapts CLIP for robust, data-efficient, and uncertainty-aware medical image segmentation. Our approach leverages patch-level CLIP embeddings through probabilistic cross-modal attention, enabling bidirectional interaction between image and text tokens and explicit modeling of predictive uncertainty. Together with a soft patch-level contrastive loss that encourages nuanced semantic learning across diverse textual prompts, MedCLIPSeg improves data efficiency and domain generalizability. Extensive experiments across 16 datasets, spanning five imaging modalities and six organs, demonstrate that MedCLIPSeg outperforms prior methods in accuracy, efficiency, and robustness, while providing interpretable uncertainty maps that highlight the local reliability of segmentation results.

Method

Bidirectional Vision–Language Fusion: Representation-level adapters enable two-way interaction between image patches and text tokens while keeping CLIP encoders frozen.
Probabilistic Cross-Modal Attention (PVL): Variational Key and Value distributions model uncertainty; confidence-weighted attention downweights unreliable tokens.
Pixel-Level Uncertainty Estimation: Monte Carlo sampling of Value distributions yields mean masks and entropy-based uncertainty maps.
Soft Patch-Level Contrastive Loss: Encourages nuanced alignment across diverse prompts and improves generalization under limited supervision.

Overall architecture of MedCLIPSeg integrating probabilistic vision–language fusion into a CLIP-based segmentation pipeline.

Probabilistic Vision–Language (PVL) adapters for confidence-weighted, bidirectional cross-modal interaction.

Results

We evaluate (i) data efficiency by training with 10% / 25% / 50% / 100% of available data, and (ii) domain generalization by training on an in-distribution source dataset and testing on unseen target datasets without adaptation. Metrics: DSC and NSD.

Data-Efficiency Evaluation

Method	10% Data		25% Data		50% Data		100% Data
Method	DSC ↑	NSD ↑	DSC ↑	NSD ↑	DSC ↑	NSD ↑	DSC ↑	NSD ↑
Unimodal Approaches
UNet	60.95	64.43	62.74	66.16	71.61	75.14	78.49	82.07
UNet++	63.72	67.08	65.86	69.21	73.15	76.31	78.44	81.79
DeepLabv3	61.32	64.84	65.39	69.10	68.58	72.57	73.28	77.42
Attention U-Net	62.78	66.25	64.97	68.53	71.34	74.96	76.30	79.77
nnU-Net	73.45	77.37	76.73	80.66	78.86	82.68	81.40	85.08
Swin-UNet	53.04	57.91	54.69	59.24	55.89	61.25	65.03	69.32
TransUNet	52.69	56.38	55.25	58.95	55.22	59.30	67.22	71.15
Generic Text-driven Approaches
LViT	66.51	68.80	75.66	78.12	78.88	81.34	83.35	85.89
Ariadne’s Thread	61.34	62.75	63.09	64.51	65.65	66.92	70.07	71.49
CLIP-based Approaches
EoMT-CLIP	74.07	77.41	76.29	79.84	79.19	82.78	82.93	86.35
CLIPSeg	74.66	77.75	78.31	81.34	79.63	82.58	84.87	87.74
DenseCLIP	67.84	70.33	70.23	72.70	72.09	74.45	74.19	76.89
ZegCLIP	61.25	63.72	72.46	75.01	76.21	78.80	78.98	81.69
SAN	74.13	76.97	76.13	78.91	78.80	81.52	81.62	84.35
MaPLe	66.27	68.75	71.53	73.95	74.60	77.12	74.60	77.10
MaPLe + Decoder	74.81	77.90	79.64	82.60	82.81	85.80	84.94	87.91
VLSM-Adapter	74.47	77.50	77.63	80.53	80.83	83.77	83.85	86.72
CausalCLIPSeg	71.19	73.74	75.42	78.00	78.60	81.22	81.34	84.20
CAT-Seg	78.76	81.50	81.12	83.92	83.32	85.61	85.90	88.31
MedCLIPSeg (Ours)	81.10	83.94	85.08	87.85	87.18	89.95	88.66	91.35

Domain Generalization (DSC %)

Method	Breast Ultrasound				Polyp Endoscopy				Brain MRI		Skin Dermatoscopy
Method	BUSI	BUSBRA	BUSUC	BUID	UDIAT	Kvasir-SEG	ColonDB	ClinicDB	BTMRI	BRISC	ISIC	UWaterloo
LViT	75.32	59.41	67.95	53.51	65.60	85.29	60.01	75.27	81.41	71.86	91.21	58.87
CLIPSeg	80.95	63.66	75.03	68.43	56.67	81.98	59.93	71.49	86.33	77.61	90.55	80.19
DenseCLIP	71.85	53.34	70.97	63.53	54.93	79.32	56.38	68.08	70.30	34.12	89.29	53.39
ZegCLIP	72.08	61.08	73.57	71.75	52.41	78.46	53.46	69.75	76.65	66.31	81.45	38.60
SAN	77.99	64.37	74.15	58.13	61.98	83.16	61.82	74.46	85.27	71.60	91.39	82.51
MaPLe	66.37	50.08	71.52	70.77	57.81	76.12	48.09	59.64	75.40	45.19	88.31	69.12
MaPLe + Decoder	80.49	55.89	64.96	60.66	59.44	83.46	61.53	71.20	85.08	71.46	90.10	81.83
VLSM-Adapter	80.90	68.48	82.37	75.26	69.16	85.89	63.51	76.09	85.03	68.92	91.30	82.17
CausalCLIPSeg	76.11	55.87	69.12	64.49	48.90	78.77	41.65	57.54	81.71	53.96	89.47	48.73
CAT-Seg	81.83	70.94	81.48	73.37	70.30	86.43	68.49	70.35	84.86	76.28	91.27	82.02
MedCLIPSeg (Ours)	85.72	75.06	84.37	78.99	74.64	90.15	71.90	80.80	88.03	80.92	92.54	83.53

Segmentation & Uncertainty Visualization

MedCLIPSeg produces both a segmentation mask and a dense uncertainty map. Uncertainty tends to peak along ambiguous boundaries and challenging regions, and remains consistent across in-distribution and out-of-distribution samples—supporting interpretability and reliability review.

Segmentation and uncertainty visualization — Example predictions with uncertainty maps. ID datasets are in blue while OOD datasets are in red.

Ablation Studies

We analyze the contribution of MedCLIPSeg’s components and design choices, including PVL adapters, gating, probabilistic attention, bidirectional interaction, contrastive loss, prompt style, and CLIP backbone selection.

Effectiveness of Key Design Components

Method	ID DSC (%)	OOD DSC (%)	HM DSC (%)
MedCLIPSeg (Ours)	89.11	79.02	83.76
Probabilistic Vision–Language Adapters
w/o PVL Adapters	81.23 (−7.88)↓	55.23 (−23.79)↓	65.75 (−18.01)↓
w/o Gating	87.55 (−1.56)↓	76.79 (−2.23)↓	81.82 (−1.94)↓
w/o AttnPVL	86.21 (−2.90)↓	74.13 (−4.89)↓	79.71 (−4.05)↓
Deterministic MedCLIPSeg	87.68 (−1.43)↓	63.12 (−15.90)↓	73.40 (−10.36)↓
Bidirectional Multimodal Interaction
w/o Visual Adaptation	81.50 (−7.61)↓	64.40 (−14.62)↓	71.95 (−11.81)↓
w/o Textual Adaptation	88.83 (−0.28)↓	76.40 (−2.62)↓	82.15 (−1.61)↓
w/o Bidirectional Interaction	88.71 (−0.40)↓	77.71 (−1.31)↓	82.85 (−0.91)↓
Unimodal MedCLIPSeg	86.53 (−2.58)↓	73.49 (−5.53)↓	79.48 (−4.28)↓
Contrastive Loss
w/o SoftCon Loss	87.24 (−1.87)↓	77.08 (−1.94)↓	81.84 (−1.92)↓
Hard Targets	88.34 (−0.77)↓	77.64 (−1.38)↓	82.65 (−1.11)↓
Attention-pooled SoftCon Loss	88.73 (−0.38)↓	75.60 (−3.42)↓	81.64 (−2.12)↓

Layer-wise PVL Intervention and Confidence Weight (β)

Layer-wise interventions and confidence weight (beta) ablations

Layer-wise PVL adapter interventions and confidence weight (β) ablations averaged on ID and OOD data. Deeper interventions improve performance up to ~Layer 10, and β = 2.35 yields the best harmonic mean.

Effect of Text Prompt Design

Text Prompt Style	ID DSC (%)	OOD DSC (%)	HM DSC (%)
Contradictory	68.60	63.21	65.79
Missing Location	86.98	77.75	82.11
Overdescriptive	82.93	74.49	78.48
Underdescriptive	66.91	49.38	56.82
Original	89.11	79.02	83.76

Effect of Pre-trained Vision–Language Models

Pre-trained Model	ID DSC (%)	OOD DSC (%)	HM DSC (%)
CLIP	88.48	74.81	81.07
PubMedCLIP	86.67	73.05	79.28
BiomedCLIP	88.70	77.08	82.48
UniMedCLIP	89.11	79.02	83.76

BibTeX

@inproceedings{koleilat2026medclipseg,
  title     = {MedCLIPSeg: Probabilistic Vision--Language Adaptation for Data-Efficient and Generalizable Medical Image Segmentation},
  author    = {Koleilat, Taha and Asgariandehkordi, Hojat and Nejati Manzari, Omid and Barile, Berardino and Xiao, Yiming and Rivaz, Hassan},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}