Skip to yearly menu bar Skip to main content


Poster

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

Bin Zhu · Bin Lin · Munan Ning · YANG YAN · Jiaxi Cui · WANG HongFa · Yatian Pang · Wenhao Jiang · Junwu Zhang · Zongwei Li · Cai Zhang · Zhifeng Li · Wei Liu · Yuan Li

Halle B
[ ]
Wed 8 May 7:30 a.m. PDT — 9:30 a.m. PDT

Abstract: The video-language (VL) pretraining has achieved remarkable improvement in multiple downstream tasks. However, the current VL pretraining framework is hard to extend to multiple modalities (N modalities, $N\geq3$) beyond vision and language. We thus propose LanguageBind, taking the language as the bind across different modalities because the language modality is well-explored and contains rich semantics. Specifically, we freeze the language encoder acquired by VL pretraining, then train encoders for other modalities with contrastive learning. As a result, all modalities are mapped to a shared feature space, implementing multi-modal semantic alignment. While LanguageBind ensures that we can extend VL modalities to N modalities, we also need a high-quality dataset with alignment data pairs centered on language. We thus propose VIDAL-10M with Video, Infrared, Depth, Audio and their corresponding Language, naming as VIDAL-10M. In our VIDAL-10M, all videos are from short video platforms with complete semantics rather than truncated segments from long videos, and all the video, depth, infrared, and audio modalities are aligned to their textual descriptions. After pretraining on VIDAL-10M, we outperform ImageBind by 1.2% R@1 on the MSR-VTT dataset with only 15% of the parameters in the zero-shot video-text retrieval, validating the high quality of our dataset. Beyond this, our LanguageBind has achieved great improvement in the zero-shot video, audio, depth, and infrared understanding tasks. For instance, on the LLVIP and NYU-D datasets, LanguageBind outperforms ImageBind-huge with 23.8% and 11.1% top-1 accuracy.

Chat is not available.