Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning

MICCAI 2026 Early Accept

MICCAI 2026

Evi-Steer overview figure

Evidential cross-modal low-dimensional steering for BiomedCLIP, enabling uncertainty-aware parameter-efficient adaptation with conservative, confidence-weighted updates under few-shot learning and domain shift.

Abstract

Parameter-efficient adaptation of biomedical vision-language foundation models is essential for robust multimodal understanding in low-data and shifted clinical settings. Existing adaptation methods are often deterministic and can apply residual updates even when image-text evidence is ambiguous or unreliable. We present Evi-Steer, an evidential cross-modal low-dimensional steering framework for BiomedCLIP that enables uncertainty-aware fine-tuning while updating only 0.11% of the total model parameters. Evi-Steer performs lightweight token updates in both the vision and text encoders, estimates latent-dimension epistemic uncertainty, and uses these estimates to conservatively gate residual steering. It further introduces Dempster-Shafer cross-modal confidence fusion, conditioning visual adaptation on textual confidence to suppress conflicting or uncertain updates. Across 15 biomedical imaging datasets spanning 8 organs and 8 imaging modalities, Evi-Steer improves few-shot learning and domain generalization, providing a practical pathway for reliable biomedical VLM adaptation.

Method

  • Evidential Low-Dimensional Representation Steering: Evi-Steer projects text and vision activations into a compact latent space, computes lightweight token updates, and estimates latent-dimension epistemic uncertainty in a single forward pass.
  • Confidence-Weighted Residual Updates: Evidence-derived uncertainty is converted into confidence weights, allowing the model to steer strongly when evidence is reliable and adapt conservatively when evidence is weak.
  • Cross-Modal Reliability Fusion: A Dempster-Shafer belief fusion mechanism conditions vision-side updates on textual confidence, suppressing ambiguous or conflicting image-text updates under domain shift.
  • Frozen BiomedCLIP Backbone: The core BiomedCLIP vision and text encoders remain frozen, preserving pretrained representations while training a small set of steering modules.
Overall architecture of Evi-Steer

Overall Evi-Steer pipeline: evidential cross-modal low-dimensional adapters generate confidence-weighted representation updates for the text and vision encoders.

Results

Evi-Steer is evaluated in two clinically relevant settings: few-shot adaptation with K = 4, 8, and 16 labeled samples per class, and domain generalization, where models are trained with 16 shots on source datasets and tested on unseen target datasets without additional adaptation. Results below report average accuracy over three seeds.

Few-Shot Evaluation

Method K=4 K=8 K=16
Zero-shot BiomedCLIP43.81
CoOp65.5272.3676.26
CoCoOp60.6367.7572.25
KgCoOp65.1970.7472.48
ProGrad66.3371.7673.98
BiomedCoOp67.5072.4377.15
LP++65.5170.8575.42
CLIP-Adapter47.1148.5150.60
Tip-Adapter-F66.2272.7377.60
GDA67.3474.9277.23
CLIP-LoRA65.9372.4774.75
Evi-Steer (Ours) 71.43 77.33 81.18

Domain Generalization

Method ID OOD HM
Zero-shot BiomedCLIP58.2760.6559.44
CoOp75.9373.4674.67
CoCoOp74.3371.3272.79
ProGrad77.0573.3675.16
KgCoOp75.8575.0375.44
GDA74.3670.7272.49
CLIP-LoRA75.8171.6973.69
BiomedCoOp76.8272.3074.49
Evi-Steer (Ours) 79.78 77.95 78.85

Detailed OOD Transfer

Method Breast Ultrasound Brain MRI
BUSIBUIDBUSBRAUDIAT BTMRIBTMRI-PBTMRI-SBRISC
BiomedCLIP59.7575.0066.7861.5456.7952.8055.2052.60
CoOp69.4972.2268.2870.4982.3776.7778.1374.87
CoCoOp70.2067.5966.5571.5478.4575.4376.6270.20
ProGrad71.4769.4467.2569.2382.6378.0079.9676.27
KgCoOp70.6277.7868.7975.6481.0775.6079.0273.37
GDA66.8163.8963.0267.9581.9177.4776.7575.23
CLIP-LoRA71.4265.8565.3769.8280.1977.9077.6873.50
BiomedCoOp70.3467.5962.9071.6783.3077.6079.5274.50
Evi-Steer (Ours) 73.1678.7071.6178.21 86.3980.2380.5378.43

Ablation Studies

Ablations show that visual adaptation, textual adaptation, evidential updates, and cross-modal belief fusion all contribute to the final domain-generalization performance, with visual adaptation producing the largest OOD gain.

Effectiveness of Key Components

Method ID Acc. (%) OOD Acc. (%) HM Acc. (%)
Evi-Steer (Ours) 79.79 77.97 78.87
w/o Visual Adaptation76.23 (−3.56)72.25 (−5.72)74.19 (−4.68)
w/o Textual Adaptation78.40 (−1.39)76.97 (−1.00)77.68 (−1.19)
w/o Evidential Update79.25 (−0.54)76.34 (−1.63)77.77 (−1.10)
w/o Cross-modal Belief79.52 (−0.27)76.90 (−1.07)78.19 (−0.68)

The layer-depth and adapter-dimension analyses further suggest that distributing lightweight adapters across more layers improves generalization, while a compact latent dimension of r = 4 provides the best harmonic-mean accuracy by balancing adaptation capacity and overfitting.

BibTeX

@inproceedings{koleilat2026evisteer,
  title={Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning},
  author={Koleilat, Taha and Rivaz, Hassan and Xiao, Yiming},
  booktitle={International Conference on Medical Image Computing and Computer-Assisted Intervention},
  year={2026}
}