BiomedCoOp: Learning to Prompt for Biomedical Vision-Language Models

Taha Koleilat · Hojat Asgariandehkordi · Hassan Rivaz · Yiming Xiao
Concordia University · Health-X Lab · IMPACT Lab
BiomedCoOp overview

Overview of BiomedCoOp: LLM-guided prompt ensembles and selective distillation enable robust few-shot biomedical image classification.

Abstract

Recent advancements in vision-language models (VLMs), such as CLIP, have demonstrated substantial success in self-supervised representation learning for vision tasks. However, effectively adapting VLMs to downstream applications remains challenging, as their accuracy often depends on time-intensive and expertise-demanding prompt engineering, while full model fine-tuning is costly. This is particularly true for biomedical images, which, unlike natural images, typically suffer from limited annotated datasets, unintuitive image contrasts, and nuanced visual features. Recent prompt learning techniques, such as Context Optimization (CoOp) intend to tackle these issues, but still fall short in generalizability. Meanwhile, explorations in prompt learning for biomedical image analysis are still highly limited. In this work, we propose BiomedCoOp, a novel prompt learning framework that enables efficient adaptation of BiomedCLIP for accurate and highly generalizable few-shot biomedical image classification. Our approach achieves effective prompt context learning by leveraging semantic consistency with average prompt ensembles from Large Language Models (LLMs) and knowledge distillation with a statistics-based prompt selection strategy. We conducted comprehensive validation of our proposed framework on 11 medical datasets across 9 modalities and 10 organs against existing state-of-the-art methods, demonstrating significant improvements in both accuracy and generalizability.

Method

  • LLM-guided Prompt Ensembles: LLM-generated biomedical descriptions guide context learning.
  • Semantic Consistency: Prompt contexts are aligned with averaged LLM embeddings.
  • Selective Knowledge Distillation: Statistics-based pruning removes noisy prompts.
  • BiomedCLIP Backbone: Enables robust multi-modal biomedical representations.
BiomedCoOp method
<

Few-shot Evaluation

Method K = 1 K = 2 K = 4 K = 8 K = 16
Zero-shot Methods
BiomedCLIP42.05
BiomedCLIP + Ensemble52.27
BiomedCLIP + Selective Ensemble53.72
CLIP-based Adapter Methods
CLIP-Adapter44.66 ± 2.9743.91 ± 2.4844.36 ± 1.9445.42 ± 2.3846.69 ± 1.71
Tip-Adapter49.19 ± 4.8452.36 ± 6.5757.33 ± 5.0761.98 ± 5.7667.15 ± 4.25
Tip-Adapter-F51.17 ± 8.3352.74 ± 5.8861.23 ± 6.2265.91 ± 3.6470.91 ± 2.65
Prompt Learning Methods
CoOp50.16 ± 6.9354.18 ± 4.3159.75 ± 3.7265.84 ± 3.6669.62 ± 2.83
CoCoOp48.49 ± 4.3951.28 ± 5.0654.69 ± 4.7961.08 ± 3.4965.09 ± 2.87
KgCoOp50.85 ± 5.5953.18 ± 4.3357.82 ± 4.5062.08 ± 2.5962.84 ± 1.72
ProGrad51.88 ± 6.3954.71 ± 4.4660.42 ± 4.7865.61 ± 3.0267.13 ± 3.00
BiomedCoOp (Ours) 57.03 ± 2.80 59.13 ± 3.64 63.95 ± 2.42 68.32 ± 2.65 72.42 ± 1.69

Base-to-Novel Generalization

Dataset Split BiomedCLIP CoOp CoCoOp KgCoOp ProGrad BiomedCoOp
Average Base47.8473.8572.2668.3671.6776.26
Novel65.4264.7567.0364.0866.9373.92
HM53.8167.2367.2264.6167.4375.07

Visual Interpretability (Saliency Maps)

We visualize model attention using gScoreCAM. Compared to baseline prompt learning methods, BiomedCoOp consistently focuses on clinically relevant regions while suppressing background noise across modalities.

Saliency map comparison across methods
Comparison of gScoreCAM saliency maps for BiomedCLIP, CoOp, CoCoOp, and BiomedCoOp (ours) across representative biomedical images.

BibTeX

@inproceedings{koleilat2025biomedcoop,
  title={Biomedcoop: Learning to prompt for biomedical vision-language models},
  author={Koleilat, Taha and Asgariandehkordi, Hojat and Rivaz, Hassan and Xiao, Yiming},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={14766--14776},
  year={2025}
}