Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Mathematical and Empirical Understanding of Foundation Models (ME-FoMo)

Variational prompt tuning improves generalization of vision-language foundation models

Mohammad Mahdi Derakhshani · Enrique Sanchez · Adrian Bulat · Victor Guilherme Turrisi da Costa · Cees G Snoek · Georgios Tzimiropoulos · Brais Martinez

Keywords: [ vision and language models ] [ prompt tuning ] [ variational inference ] [ foundation models ]


Abstract:

Using prompt tuning, large vision-language foundation models can be adapted to downstream tasks by treating part of the input language prompts as learnable parameters and freezing the rest. However, existing work on prompt tuning may damage the generalization capabilities of foundation models. To avoid such limitations, we propose a probabilistic modeling of the underlying distribution of prompts, allowing prompts within the support of an associated concept to be de- rived through stochastic sampling. This results in a more complete and richer transfer of the information captured by the language model, providing better generalization capabilities for downstream tasks. The resulting algorithm relies on a simple yet powerful variational framework that can be directly integrated with other developments. We show our approach is seamlessly integrated into both standard and conditional prompt learning frameworks, improving the performance in both cases considerably, especially with regard to preserving the generalization capability of the original model. Our method provides the current state-of-the-art for prompt learning, surpassing CoCoOp by 1.6% average Top-1 accuracy on the standard benchmark. Remarkably, it even surpasses the original CLIP model in terms of generalization to new classes. The implementation code will be released.

Chat is not available.