Poster
Improved Probabilistic Image-Text Representations
Sanghyuk Chun
Halle B
Image-Text Matching (ITM) task, a fundamental vision-language (VL) task, suffers from the inherent ambiguity caused by the multiplicity and the imperfect annotations. Deterministic functions are not sufficiently powerful to capture ambiguity, prompting a focus on probabilistic embeddings to tackle the challenge. However, the existing probabilistic ITM approach encounters two drawbacks; heavy computations due to expensive Monte Carlo approximation and the loss saturation under abundant false negatives. To overcome the issues, we propose improved Probabilistic Cross-Modal Embeddings (PCME++) by introducing a new probabilistic distance with a closed-form solution. We propose two optimization techniques to enhance PCME++ further; first, we incorporate pseudo-positives to prevent the loss saturation problem under massive false negatives; second, we apply mixed sample data augmentation for probabilistic matching. Our experiments on MS-COCO Caption and two extended benchmarks, CxC and ECCV Caption, demonstrate the effectiveness of PCME++ compared to state-of-the-art ITM methods. We also evaluate the robustness of PCME++ under noisy image-text correspondences. In addition, we show the potential applicability of PCME++ in automatic prompt tuning for zero-shot classification. Code is available at https://github.com/naver-ai/pcmepp