Singular Value Few-shot Adaptation of Vision-Language Models

Taha Koleilat · Hassan Rivaz · Yiming Xiao
Concordia University · IMPACT Lab · Health-X Lab

arXiv preprint (2025).

CLIP-SVD overview

CLIP-SVD adapts vision–language models by fine-tuning only singular values, achieving state-of-the-art few-shot performance using just 0.04% of parameters.

Abstract

Abstract: Vision-language models (VLMs) like CLIP have shown impressive zero-shot and few-shot learning capabilities across diverse applications. However, adapting these models to new fine-grained domains remains difficult due to reliance on prompt engineering and the high cost of full model fine-tuning. Existing adaptation approaches rely on augmented components, such as prompt tokens and adapter modules, which could limit adaptation quality, destabilize the model, and compromise the rich knowledge learned during pretraining. In this work, we present CLIP-SVD, a novel multi-modal and parameter-efficient adaptation technique that leverages Singular Value Decomposition (SVD) to modify the internal parameter space of CLIP without injecting additional modules. Specifically, we fine-tune only the singular values of the CLIP parameter matrices to rescale the basis vectors for domain adaptation while retaining the pretrained model. This design enables enhanced adaptation performance using only 0.04% of the model's total parameters and better preservation of its generalization ability. CLIP-SVD achieves state-of-the-art classification results on 11 natural and 10 biomedical datasets, outperforming previous methods in both accuracy and generalization under few-shot settings. Additionally, we leverage a natural language-based approach to analyze the effectiveness and dynamics of the CLIP adaptation to enable interpretability of CLIP-SVD.

Method

  • SVD-based Adaptation: Fine-tunes only singular values.
  • Extreme Parameter Efficiency: 0.04% trainable parameters.
  • Multi-modal: Applicable to CLIP and BiomedCLIP.
  • Preserved Generalization: Keeps pretrained bases intact.
  • Interpretability: Natural language analysis of singular value dynamics.
CLIP-SVD method
CLIP-SVD adapts models by rescaling singular values while preserving pretrained bases.

Results

Natural Few-shot Evaluation

Method K = 1 K = 2 K = 4 K = 8 K = 16
Zero-shot CLIP 65.36
CoOp 68.0970.1373.5976.4579.01
MaPLe 69.2772.5875.3778.8981.79
CLIP-LoRA 72.2075.4177.32 80.1082.89
CLIP-SVD (Ours) 73.2076.0678.18 80.5582.97

Biomedical Few-shot Evaluation

Method K = 1 K = 2 K = 4 K = 8 K = 16
Zero-shot BiomedCLIP 42.38
CoOp 52.5955.7161.3567.7471.48
BiomedCoOp 56.8759.3264.34 68.9673.41
CLIP-SVD (Ours) 56.3562.6368.02 73.2676.46

Base-to-Novel Generalization (Natural)

Method Base Novel HM
CLIP 69.3474.2271.70
CoOp 82.6963.2271.66
CoCoOp 80.4771.6975.83
KgCoOp 80.7373.6077.00
ProGrad 82.4870.7576.16
MaPLe 82.2875.1478.55
IVLP 84.2171.7977.51
GDA 83.9674.5378.72
TCP 84.1375.3679.51
CLIP-LoRA 84.1074.8079.18
CLIP-SVD (Ours) 84.3876.2980.13

Base-to-Novel Generalization (Biomedical)

Method Base Novel HM
BiomedCLIP 49.2767.1755.23
CoOp 76.7165.3468.80
CoCoOp 75.5267.7469.11
KgCoOp 71.9065.9467.22
ProGrad 75.6967.3369.86
MaPLe 65.4049.5153.10
XCoOp 74.6263.1968.43
BiomedCoOp 78.6073.9074.04
GDA 57.7064.6660.98
DCPL 73.7069.3571.46
CLIP-LoRA 70.5659.8464.76
CLIP-SVD (Ours) 82.6474.3178.25

Interpretability Evaluation

CLIP-SVD enables a principled analysis of adaptation dynamics by inspecting ranked singular value updates and mapping them to natural language concepts. This facilitates interpretation of which semantic dimensions are amplified or suppressed during few-shot learning, while preserving the pretrained representation structure.

Interpretability analysis of CLIP-SVD using singular value rankings
Interpretability analysis of CLIP-SVD. Ranked singular value updates are associated with natural language descriptions, revealing task-relevant semantic shifts while maintaining stable lower-rank representations.

BibTeX

@article{koleilat2025singular,
  title={Singular Value Few-shot Adaptation of Vision-Language Models},
  author={Koleilat, Taha and Rivaz, Hassan and Xiao, Yiming},
  journal={arXiv preprint arXiv:2509.03740},
  year={2025}
}