Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Mathematical and Empirical Understanding of Foundation Models (ME-FoMo)

Coordinating Multiple Vision-Language Models for Visual Reasoning

Liangyu Chen · Bo Li · Sheng Shen · Jingkang Yang · Chunyuan Li · Kurt Keutzer · trevor darrell · Ziwei Liu

Keywords: [ instruction-based learning ] [ finetuning large language models ] [ visual reasoning ]


Abstract:

Visual reasoning demands multimodal perception and commonsense cognition of the world. Multiple vision-language models (VLMs) have recently been proposed with excellent commonsense reasoning ability in various domains. However, how to harness the collective power of these complementary VLMs is rarely explored. Existing methods like ensemble still struggle to combine these models with the desired higher-order communications. In this work, we propose COLA (Code is available at https://anonymous.4open.science/r/visualreasoning), a novel paradigm that coordinates multiple VLMs for visual reasoning. Our key insight is that a language model (LM) can serve as an efficient coordinator to leverage the distinct and complementary capabilities of multiple VLMs. Extensive experiments demonstrate that our finetuning variant, COLA-FT, achieves state-of-the-art performance on outside knowledge VQA, visual entailment, and visual-spatial reasoning tasks. Through systematic ablation studies and visualizations, we validate that a coordinator LM comprehends the instruction prompts and the separate functionalities of VLMs and then coordinates them to enable impressive visual reasoning capabilities.

Chat is not available.