Poster
in
Workshop: Multimodal Representation Learning (MRL): Perks and Pitfalls
Hyperbolic Image-Text Representations
Karan Desai · Maximilian Nickel · Tanmay Rajpurohit · Justin Johnson · Shanmukha Ramakrishna Vedantam
Keywords: [ transformers ] [ representation learning ] [ vision and language ] [ riemannian geometry ]
Visual and linguistic concepts naturally organize themselves in a hierarchy, where a textual concept 'dog' entails all images that contain dogs. Despite being intuitive, current large-scale vision and language models such as CLIP do not explicitly capture such hierarchy. We propose MERU, a contrastive model that yields hyperbolic representations of images and text. Hyperbolic manifolds have suitable geometric properties to embed tree-like data, so MERU can better capture the underlying hierarchy in image-text datasets. Our results show that MERU learns a highly interpretable and structured representation space while maintaining (or improving) CLIP's performance on standard transfer tasks like zero-shot classification, retrieval, and resource-constrained deployment.