Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Multimodal Representation Learning (MRL): Perks and Pitfalls

Hyperbolic Image-Text Representations

Karan Desai · Maximilian Nickel · Tanmay Rajpurohit · Justin Johnson · Shanmukha Ramakrishna Vedantam

Keywords: [ transformers ] [ representation learning ] [ vision and language ] [ riemannian geometry ]


Abstract:

Visual and linguistic concepts naturally organize themselves in a hierarchy, where a textual concept 'dog' entails all images that contain dogs. Despite being intuitive, current large-scale vision and language models such as CLIP do not explicitly capture such hierarchy. We propose MERU, a contrastive model that yields hyperbolic representations of images and text. Hyperbolic manifolds have suitable geometric properties to embed tree-like data, so MERU can better capture the underlying hierarchy in image-text datasets. Our results show that MERU learns a highly interpretable and structured representation space while maintaining (or improving) CLIP's performance on standard transfer tasks like zero-shot classification, retrieval, and resource-constrained deployment.

Chat is not available.