Skip to yearly menu bar Skip to main content


Poster

Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models

Archiki Prasad · Elias Stengel-Eskin · Mohit Bansal

Halle B
[ ]
Tue 7 May 1:45 a.m. PDT — 3:45 a.m. PDT

Abstract:

An increasing number of vision-language tasks can be handled with little to no training (i.e., in a zero and few-shot manner) by marrying large language models (LLMs) to vision encoders, resulting in large vision-language models (LVLMs). While this has huge upsides (e.g., not requiring training data or custom architectures), how an input is presented to a LVLM can have a major impact on zero-shot model performance. In particular, inputs phrased in an underspecified way can result in incorrect answers due to factors like missing visual information, complex implicit reasoning, or linguistic ambiguity. Therefore, adding visually-grounded information should improve model performance by reducing underspecification, e.g., by localizing objects and disambiguating references. To this end, we present Rephrase, Augment and Reason (RepARe), a gradient-free framework, which extracts salient details about the image using the underlying LVLM as a captioner and reasoner, in order to propose modifications to the original question. We then use the LVLM’s confidence over a generated answer as an unsupervised scoring function to select the rephrased question most likely to improve zero-shot performance. Focusing on two visual question answering tasks, we show that RepARe can result in an 3.85 percentage point (absolute) increase in zero-shot performance on VQAv2 and a 6.41 point increase on A-OKVQA. Additionally, we find that using gold answers for oracle selection of question candidates achieves an impressive gain in VQA accuracy by up to 14.41 percentage points. Through extensive analysis, we demonstrate that outputs from RepARe increase syntactic complexity and better utilize the frozen language model in LVLMs.

Chat is not available.