Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Multimodal Representation Learning (MRL): Perks and Pitfalls

Classifier-free guidance makes image captioning models more descriptive

Simon Kornblith · Lala Li · Zirui Wang · Thao Nguyen

Keywords: [ CIDEr ] [ CLIP ] [ prompt inversion ] [ image captioning ] [ classifier-free guidance ] [ CLIPScore ]


Abstract: Image captioning is conventionally formulated as the task of generating captions that are similar to a set of human-generated reference captions, as measured using evaluation metrics such as CIDEr, ROUGE, and BLEU. Recent work has also explored reference-free captioning metrics based on the distance between generated captions and the corresponding images in the embedding space of a contrastively-trained image-text model such as CLIP. Here, we show that it is possible to trade off between reference-free and reference-based captioning metrics by decoding from a single autoregressive captioning model using classifier-free guidance (Ho & Salimans, 2021). Compared to standard greedy decoding, decoding from the same model with a guidance scale of 2 substantially improves caption$\to$image retrieval performance when captions and images are embedded using CLIP (recall@1 44.3% vs. 26.6%) and marginally improves CLIPScore (0.786 vs. 0.761), but greatly worsens standard reference-based captioning metrics (e.g., CIDEr 79.9 vs 126.3). Manual inspection reveals that higher guidance scales produce more descriptive but less grammatical captions.

Chat is not available.