ICLR Poster Beyond task performance: evaluating and reducing the flaws of large multimodal models with in-context-learning

Poster

Beyond task performance: evaluating and reducing the flaws of large multimodal models with in-context-learning

Mustafa Shukor · Alexandre Rame · Corentin Dancette · MATTHIEU CORD

Halle B

[ Abstract ]

[ OpenReview]

Abstract:

Following the success of Large Language Models (LLMs), Large Multimodal Models (LMMs), such as the Flamingo model and its subsequent competitors, have started to emerge as natural step towards generalist agents. However, interacting with recent LMMs reveals major limitations that are hardly captured by the current evaluation benchmarks. Indeed, task performances (e.g., VQA accuracy) alone do not provide enough clues to understand their real capabilities, limitations, and to which extent such models are aligned to human expectations. To refine our understanding on those flaws, we deviate from the current evaluation paradigm, and (1) evaluate 8 recent open-source LMMs (based on the Flamingo architecture such as OpenFlamingo and IDEFICS) on 5 different axes; hallucinations, abstention, compositionality, explainability and instruction following. Our evaluation on these axes reveals major flaws in LMMs. To efficiently address these problems, and inspired by the success of In-Context Learning (ICL) in LLMs, (2) we explore ICL as a solution, and study how it affects these limitations. Based on our ICL study, (3) we push ICL further, and propose new multimodal ICL approaches such as; Multitask-ICL, Chain-of-Hindsight-ICL and Self-Correcting-ICL. Our findings are as follows. (1) Despite their success, LMMs have flaws that remain unsolved with scaling alone. (2) The effect of ICL on LMMs flaws is nuanced; despite its effectiveness for improved explainability, abstention and instruction following, ICL does not improve compositional abilities, and actually even amplifies hallucinations. (3) The proposed ICL variants are promising as post-hoc approaches to efficiently tackle some of those flaws. The code will be made public.

Chat is not available.