Spotlight
Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages
Jinyi Hu · Yuan Yao · Chongyi Wang · SHAN WANG · Yinxu Pan · Qianyu Chen · Tianyu Yu · Hanghao Wu · Yue Zhao · Haoye Zhang · Xu Han · Yankai Lin · Jiao Xue · dahai li · Zhiyuan Liu · Maosong Sun
Recently there has been a significant surge in multimodal learning in terms of both image-to-text and text-to-image generation. However, the success is typically limited to English, leaving other languages largely behind. Building a competitive counterpart in other languages is highly challenging due to the low-resource nature of non-English multimodal data (i.e., lack of large-scale, high-quality image-text data). In this work, we propose \trainname, an effective training paradigm for training large multimodal models in low-resource languages. \trainname demonstrates that \textbf{M}ultilingual language models can \textbf{P}ivot zero-shot \textbf{M}ultimodal learning across languages. Specifically, based on a strong multilingual large language model, multimodal models pretrained on English-only image-text data can well generalize to other languages in a zero-shot manner, even surpassing models trained on image-text data in native languages. Taking Chinese as a practice of \trainname, we build large multimodal models \modelname in image-to-text and text-to-image generation, which achieve state-of-the-art (open-source) performance in Chinese. To facilitate future research, we open-source codes and model weights at https://anonymous.4open.science/r/VisCPM-8E13.