Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning

MICCAI 2026 Early Accept

MICCAI 2026

Evi-Steer overview figure

Evidential cross-modal low-dimensional steering for BiomedCLIP, enabling uncertainty-aware parameter-efficient adaptation with conservative, confidence-weighted updates under few-shot learning and domain shift.

Abstract

Parameter-efficient adaptation of biomedical vision-language foundation models is essential for robust multimodal understanding in low-data and shifted clinical settings. Existing adaptation methods are often deterministic and can apply residual updates even when image-text evidence is ambiguous or unreliable. We present Evi-Steer, an evidential cross-modal low-dimensional steering framework for BiomedCLIP that enables uncertainty-aware fine-tuning while updating only 0.11% of the total model parameters. Evi-Steer performs lightweight token updates in both the vision and text encoders, estimates latent-dimension epistemic uncertainty, and uses these estimates to conservatively gate residual steering. It further introduces Dempster-Shafer cross-modal confidence fusion, conditioning visual adaptation on textual confidence to suppress conflicting or uncertain updates. Across 15 biomedical imaging datasets spanning 8 organs and 8 imaging modalities, Evi-Steer improves few-shot learning and domain generalization, providing a practical pathway for reliable biomedical VLM adaptation.

Method

  • Evidential Low-Dimensional Representation Steering: Evi-Steer projects text and vision activations into a compact latent space, computes lightweight token updates, and estimates latent-dimension epistemic uncertainty in a single forward pass.
  • Confidence-Weighted Residual Updates: Evidence-derived uncertainty is converted into confidence weights, allowing the model to steer strongly when evidence is reliable and adapt conservatively when evidence is weak.
  • Cross-Modal Reliability Fusion: A Dempster-Shafer belief fusion mechanism conditions vision-side updates on textual confidence, suppressing ambiguous or conflicting image-text updates under domain shift.
  • Frozen BiomedCLIP Backbone: The core BiomedCLIP vision and text encoders remain frozen, preserving pretrained representations while training a small set of steering modules.
Overall architecture of Evi-Steer

Overall Evi-Steer pipeline: evidential cross-modal low-dimensional adapters generate confidence-weighted representation updates for the text and vision encoders.

Results

Evi-Steer is evaluated in two clinically relevant settings: few-shot adaptation with K = 4, 8, and 16 labeled samples per class, and domain generalization, where models are trained with 16 shots on source datasets and tested on unseen target datasets without additional adaptation. Results below report average accuracy over three seeds.

Few-Shot Evaluation

Method K=4 K=8 K=16
Zero-shot BiomedCLIP43.81
CoOp65.5272.3676.26
CoCoOp60.6367.7572.25
KgCoOp65.1970.7472.48
ProGrad66.3371.7673.98
BiomedCoOp67.5072.4377.15
LP++65.5170.8575.42
CLIP-Adapter47.1148.5150.60
Tip-Adapter-F66.2272.7377.60
GDA67.3474.9277.23
CLIP-LoRA65.9372.4774.75
Evi-Steer (Ours) 71.43 77.33 81.18

Domain Generalization

Method ID OOD HM
Zero-shot BiomedCLIP58.2760.6559.44
CoOp75.9373.4674.67
CoCoOp74.3371.3272.79
ProGrad77.0573.3675.16
KgCoOp75.8575.0375.44
GDA74.3670.7272.49
CLIP-LoRA75.8171.6973.69
BiomedCoOp76.8272.3074.49
Evi-Steer (Ours) 79.78 77.95 78.85

Detailed OOD Transfer

Method Breast Ultrasound Brain MRI
BUSIBUIDBUSBRAUDIAT BTMRIBTMRI-PBTMRI-SBRISC
BiomedCLIP59.7575.0066.7861.5456.7952.8055.2052.60
CoOp69.4972.2268.2870.4982.3776.7778.1374.87
CoCoOp70.2067.5966.5571.5478.4575.4376.6270.20
ProGrad71.4769.4467.2569.2382.6378.0079.9676.27
KgCoOp70.6277.7868.7975.6481.0775.6079.0273.37
GDA66.8163.8963.0267.9581.9177.4776.7575.23
CLIP-LoRA71.4265.8565.3769.8280.1977.9077.6873.50
BiomedCoOp70.3467.5962.9071.6783.3077.6079.5274.50
Evi-Steer (Ours) 73.1678.7071.6178.21 86.3980.2380.5378.43

Ablation Studies

We conduct ablation studies to better understand the contribution of the proposed components. Results show that visual adaptation, textual adaptation, evidential uncertainty modeling, and cross-modal belief fusion all contribute to the final domain-generalization performance, with visual adaptation having the largest impact on out-of-distribution robustness.

Effectiveness of Key Components

Method ID Acc. (%) OOD Acc. (%) HM Acc. (%)
Evi-Steer (Ours) 79.79 77.97 78.87
w/o Visual Adaptation 76.23 (−3.56) 72.25 (−5.72) 74.19 (−4.68)
w/o Textual Adaptation 78.40 (−1.39) 76.97 (−1.00) 77.68 (−1.19)
w/o Evidential Update 79.25 (−0.54) 76.34 (−1.63) 77.77 (−1.10)
w/o Cross-modal Belief 79.52 (−0.27) 76.90 (−1.07) 78.19 (−0.68)

Layer and Rank Ablation Studies

We further analyze the effect of the intervention depth d and the latent adapter dimension r. Increasing the number of adapted layers consistently improves domain generalization performance, indicating that distributing lightweight steering modules throughout the network is more effective than shallow adaptation. We also observe that performance improves as the adapter dimension increases from small values and reaches its optimum at r = 4. Larger dimensions provide no additional benefit and slightly reduce performance, suggesting that compact low-dimensional updates achieve the best balance between adaptation capacity and overfitting.

Layer and rank ablation study

Layer and rank ablation studies. Left: effect of the number of adapted encoder layers (d) on harmonic-mean domain-generalization accuracy. Right: effect of the latent adapter dimension (r). Increasing intervention depth consistently improves performance, while a compact adapter dimension of r = 4 provides the best trade-off between adaptation capacity and generalization.

BibTeX

@article{koleilat2026evi,
  title={Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning},
  author={Koleilat, Taha and Rivaz, Hassan and Xiao, Yiming},
  journal={arXiv preprint arXiv:2605.26292},
  year={2026}
}