Parameter-efficient adaptation of biomedical vision-language foundation models is essential for robust multimodal understanding in low-data and shifted clinical settings. Existing adaptation methods are often deterministic and can apply residual updates even when image-text evidence is ambiguous or unreliable. We present Evi-Steer, an evidential cross-modal low-dimensional steering framework for BiomedCLIP that enables uncertainty-aware fine-tuning while updating only 0.11% of the total model parameters. Evi-Steer performs lightweight token updates in both the vision and text encoders, estimates latent-dimension epistemic uncertainty, and uses these estimates to conservatively gate residual steering. It further introduces Dempster-Shafer cross-modal confidence fusion, conditioning visual adaptation on textual confidence to suppress conflicting or uncertain updates. Across 15 biomedical imaging datasets spanning 8 organs and 8 imaging modalities, Evi-Steer improves few-shot learning and domain generalization, providing a practical pathway for reliable biomedical VLM adaptation.
Overall Evi-Steer pipeline: evidential cross-modal low-dimensional adapters generate confidence-weighted representation updates for the text and vision encoders.
Evi-Steer is evaluated in two clinically relevant settings: few-shot adaptation with K = 4, 8, and 16 labeled samples per class, and domain generalization, where models are trained with 16 shots on source datasets and tested on unseen target datasets without additional adaptation. Results below report average accuracy over three seeds.
| Method | K=4 | K=8 | K=16 |
|---|---|---|---|
| Zero-shot BiomedCLIP | – | 43.81 | – |
| CoOp | 65.52 | 72.36 | 76.26 |
| CoCoOp | 60.63 | 67.75 | 72.25 |
| KgCoOp | 65.19 | 70.74 | 72.48 |
| ProGrad | 66.33 | 71.76 | 73.98 |
| BiomedCoOp | 67.50 | 72.43 | 77.15 |
| LP++ | 65.51 | 70.85 | 75.42 |
| CLIP-Adapter | 47.11 | 48.51 | 50.60 |
| Tip-Adapter-F | 66.22 | 72.73 | 77.60 |
| GDA | 67.34 | 74.92 | 77.23 |
| CLIP-LoRA | 65.93 | 72.47 | 74.75 |
| Evi-Steer (Ours) | 71.43 | 77.33 | 81.18 |
| Method | ID | OOD | HM |
|---|---|---|---|
| Zero-shot BiomedCLIP | 58.27 | 60.65 | 59.44 |
| CoOp | 75.93 | 73.46 | 74.67 |
| CoCoOp | 74.33 | 71.32 | 72.79 |
| ProGrad | 77.05 | 73.36 | 75.16 |
| KgCoOp | 75.85 | 75.03 | 75.44 |
| GDA | 74.36 | 70.72 | 72.49 |
| CLIP-LoRA | 75.81 | 71.69 | 73.69 |
| BiomedCoOp | 76.82 | 72.30 | 74.49 |
| Evi-Steer (Ours) | 79.78 | 77.95 | 78.85 |
| Method | Breast Ultrasound | Brain MRI | ||||||
|---|---|---|---|---|---|---|---|---|
| BUSI | BUID | BUSBRA | UDIAT | BTMRI | BTMRI-P | BTMRI-S | BRISC | |
| BiomedCLIP | 59.75 | 75.00 | 66.78 | 61.54 | 56.79 | 52.80 | 55.20 | 52.60 |
| CoOp | 69.49 | 72.22 | 68.28 | 70.49 | 82.37 | 76.77 | 78.13 | 74.87 |
| CoCoOp | 70.20 | 67.59 | 66.55 | 71.54 | 78.45 | 75.43 | 76.62 | 70.20 |
| ProGrad | 71.47 | 69.44 | 67.25 | 69.23 | 82.63 | 78.00 | 79.96 | 76.27 |
| KgCoOp | 70.62 | 77.78 | 68.79 | 75.64 | 81.07 | 75.60 | 79.02 | 73.37 |
| GDA | 66.81 | 63.89 | 63.02 | 67.95 | 81.91 | 77.47 | 76.75 | 75.23 |
| CLIP-LoRA | 71.42 | 65.85 | 65.37 | 69.82 | 80.19 | 77.90 | 77.68 | 73.50 |
| BiomedCoOp | 70.34 | 67.59 | 62.90 | 71.67 | 83.30 | 77.60 | 79.52 | 74.50 |
| Evi-Steer (Ours) | 73.16 | 78.70 | 71.61 | 78.21 | 86.39 | 80.23 | 80.53 | 78.43 |
We conduct ablation studies to better understand the contribution of the proposed components. Results show that visual adaptation, textual adaptation, evidential uncertainty modeling, and cross-modal belief fusion all contribute to the final domain-generalization performance, with visual adaptation having the largest impact on out-of-distribution robustness.
| Method | ID Acc. (%) | OOD Acc. (%) | HM Acc. (%) |
|---|---|---|---|
| Evi-Steer (Ours) | 79.79 | 77.97 | 78.87 |
| w/o Visual Adaptation | 76.23 (−3.56) | 72.25 (−5.72) | 74.19 (−4.68) |
| w/o Textual Adaptation | 78.40 (−1.39) | 76.97 (−1.00) | 77.68 (−1.19) |
| w/o Evidential Update | 79.25 (−0.54) | 76.34 (−1.63) | 77.77 (−1.10) |
| w/o Cross-modal Belief | 79.52 (−0.27) | 76.90 (−1.07) | 78.19 (−0.68) |
We further analyze the effect of the intervention depth d and the latent adapter dimension r. Increasing the number of adapted layers consistently improves domain generalization performance, indicating that distributing lightweight steering modules throughout the network is more effective than shallow adaptation. We also observe that performance improves as the adapter dimension increases from small values and reaches its optimum at r = 4. Larger dimensions provide no additional benefit and slightly reduce performance, suggesting that compact low-dimensional updates achieve the best balance between adaptation capacity and overfitting.
Layer and rank ablation studies. Left: effect of the number of adapted encoder layers (d) on harmonic-mean domain-generalization accuracy. Right: effect of the latent adapter dimension (r). Increasing intervention depth consistently improves performance, while a compact adapter dimension of r = 4 provides the best trade-off between adaptation capacity and generalization.
@article{koleilat2026evi,
title={Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning},
author={Koleilat, Taha and Rivaz, Hassan and Xiao, Yiming},
journal={arXiv preprint arXiv:2605.26292},
year={2026}
}