Workshop
Multimodal Representation Learning (MRL): Perks and Pitfalls
Adrián Javaloy · Miguel Vasco · Imant Daunhawer · Petra Poklukar · Yuge Shi · Danica Kragic · Isabel Valera
Virtual
Fri 5 May, midnight PDT
Following deep learning, multimodal machine learning has made steady progress, becoming ubiquitous in many domains. Learning representations from multiple modalities can be beneficial since different perceptual modalities can inform each other and ground abstract phenomena in a more robust, generalisable way. However, the complexity of different modalities can hinder the training process, requiring careful design of the model in order to learn meaningful representations. In light of these seemingly conflicting aspects of multimodal learning, we must improve our understanding of what makes each modality different, how they interact, and what are the desiderata of multimodal representations. With this workshop, we aim to bring the multimodal community together, promoting work on multimodal representation learning that provides systematic insights into the nature of the learned representations, as well as ways to improve and understand the training of multimodal models, both from a theoretical and empirical point of view.In particular, we focus on the following questions:(Representation) How do we identify useful properties of multimodal representations?(Training) How can we promote useful properties of multimodal representations?(Modalities) What makes a modality different? How can we improve their interactions?The MRL workshop has an objective to bring together experts from the multimodal learning community in order to advance these fundamental questions and discuss the future of the field. We invite submissions that present analysis of the properties of multimodal representations, insights on interactions across modalities, as well as novel applications regarding the nature and number of modalities employed.
Schedule
Fri 12:00 a.m. - 12:10 a.m.
|
Introduction and Opening Remarks
(
Intro
)
>
SlidesLive Video |
🔗 |
Fri 12:10 a.m. - 12:40 a.m.
|
Foundations of Multimodal Machine Learning: Principles, Challenges, and Open Questions
(
Invited Talk
)
>
SlidesLive Video |
Paul Pu Liang 🔗 |
Fri 12:45 a.m. - 12:55 a.m.
|
Analyzing Multimodal Objectives Through the Lens of Generative Diffusion Guidance ( Poster ) > link | Chaerin Kong · Nojun Kwak 🔗 |
Fri 12:55 a.m. - 1:00 a.m.
|
Q&A
(
Q&A
)
>
|
Chaerin Kong · Nojun Kwak 🔗 |
Fri 1:00 a.m. - 1:10 a.m.
|
Hyperbolic Image-Text Representations
(
Poster
)
>
link
SlidesLive Video |
Karan Desai · Maximilian Nickel · Tanmay Rajpurohit · Justin Johnson · Shanmukha Ramakrishna Vedantam 🔗 |
Fri 1:11 a.m. - 1:20 a.m.
|
Coffee break
(
Q&A
)
>
|
🔗 |
Fri 1:20 a.m. - 2:00 a.m.
|
Compositionality and Abstraction in Multimodal Learning
(
Invited Talk
)
>
SlidesLive Video |
Zeynep Akata 🔗 |
Fri 2:00 a.m. - 2:03 a.m.
|
Interpreting Multimodal Video Transformers Using Brain Recordings
(
Poster
)
>
link
SlidesLive Video |
Tianai Dong · Mariya Toneva 🔗 |
Fri 2:03 a.m. - 2:06 a.m.
|
A Picture is Worth a Thousand Words: Language Models Plan from Pixels
(
Poster
)
>
link
SlidesLive Video |
Anthony Z Liu · Lajanugen Logeswaran · Sungryull Sohn · Honglak Lee 🔗 |
Fri 2:06 a.m. - 2:09 a.m.
|
Dynamic Pretraining of Vision-Language Models
(
Poster
)
>
link
SlidesLive Video |
AJ Piergiovanni · Weicheng Kuo · Wei Li · Anelia Angelova 🔗 |
Fri 2:09 a.m. - 2:12 a.m.
|
CHiLS: Zero-shot Image Classification with Hierarchical Label Sets
(
Poster
)
>
link
SlidesLive Video |
Zachary Novack · Saurabh Garg · Julian McAuley · Zachary Lipton 🔗 |
Fri 2:12 a.m. - 2:15 a.m.
|
Towards understanding the modality gap in CLIP
(
Poster
)
>
link
SlidesLive Video |
Peiyang Shi · Michael Welle · Mårten Björkman · Danica Kragic 🔗 |
Fri 2:18 a.m. - 3:00 a.m.
|
Poster Session ( Poster Session ) > link | 🔗 |
Fri 3:00 a.m. - 4:30 a.m.
|
Lunch Break
|
🔗 |
Fri 4:30 a.m. - 5:10 a.m.
|
Learning Visual Features Enriched by Audio or Language
(
Invited Talk
)
>
link
SlidesLive Video |
Kristen Grauman 🔗 |
Fri 5:10 a.m. - 5:13 a.m.
|
Using Multimodal DNNs to Localize Vision-Language Integration in the Brain
(
Poster
)
>
link
SlidesLive Video |
Vighnesh Subramaniam · Colin Conwell · Christopher Wang · Gabriel Kreiman · Boris Katz · Ignacio Cases · Andrei Barbu 🔗 |
Fri 5:13 a.m. - 5:16 a.m.
|
The Role of Pre-training Data in Transfer Learning
(
Poster
)
>
link
SlidesLive Video |
Rahim Entezari · Mitchell Wortsman · Olga Saukh · Moein Shariatnia · Hanie Sedghi · Ludwig Schmidt 🔗 |
Fri 5:16 a.m. - 5:19 a.m.
|
Multimodal Subtask Graph Generation from Instructional Videos
(
Poster
)
>
link
SlidesLive Video |
Yunseok Jang · Sungryull Sohn · Tiange Luo · Lajanugen Logeswaran · Moontae Lee · Honglak Lee 🔗 |
Fri 5:19 a.m. - 5:22 a.m.
|
Exploiting Category Names for Few-Shot Classification with Vision-Language Models
(
Poster
)
>
link
SlidesLive Video |
Taihong Xiao · Zirui Wang · Liangliang Cao · Jiahui Yu · Shengyang Dai · Ming-Hsuan Yang 🔗 |
Fri 5:22 a.m. - 5:25 a.m.
|
Classifier-free guidance makes image captioning models more descriptive
(
Poster
)
>
link
SlidesLive Video |
Simon Kornblith · Lala Li · Zirui Wang · Thao Nguyen 🔗 |
Fri 5:25 a.m. - 5:28 a.m.
|
Impossibility of Collective Intelligence
(
Poster
)
>
link
SlidesLive Video |
Krikamol Muandet 🔗 |
Fri 5:28 a.m. - 6:10 a.m.
|
Poster Session ( Poster Session ) > link | 🔗 |
Fri 6:10 a.m. - 6:20 a.m.
|
Instruction-Finetuned Foundation Models for Multimodal Web Navigation ( Poster ) > link | Hiroki Furuta · Ofir Nachum · Kuang-Huei Lee · Yutaka Matsuo · Shixiang Gu · Izzeddin Gur 🔗 |
Fri 6:20 a.m. - 6:25 a.m.
|
Q&A
(
Q&A
)
>
|
Hiroki Furuta · Ofir Nachum · Kuang-Huei Lee · Yutaka Matsuo · Shixiang Gu · Izzeddin Gur 🔗 |
Fri 6:25 a.m. - 6:35 a.m.
|
SemDeDup: Data-efficient learning at web-scale through semantic deduplication ( Poster ) > link | Amro Kamal · Kushal Tirumala · Daniel Simig · Surya Ganguli · Ari Morcos 🔗 |
Fri 6:35 a.m. - 6:40 a.m.
|
Q&A
(
Q&A
)
>
|
Amro Kamal · Kushal Tirumala · Daniel Simig · Surya Ganguli · Ari Morcos 🔗 |
Fri 6:43 a.m. - 6:45 a.m.
|
Coffee Break
|
🔗 |
Fri 6:45 a.m. - 7:15 a.m.
|
Injecting large models with new modalities for Video Understanding
(
Invited Talk
)
>
SlidesLive Video |
Arsha Nagrani 🔗 |
Fri 7:20 a.m. - 7:50 a.m.
|
Towards Structured Multimodal Representations
(
Invited Talk
)
>
SlidesLive Video |
Siddharth N 🔗 |
Fri 7:50 a.m. - 8:00 a.m.
|
Coffee Break
|
🔗 |
Fri 8:00 a.m. - 8:45 a.m.
|
The Perks and Pitfalls of MRL
(
Panel
)
>
SlidesLive Video |
Arsha Nagrani · Luca Moschella · Paul Pu Liang · Siddharth N · Valentino Maiorca 🔗 |
Fri 8:45 a.m. - 9:00 a.m.
|
Closing Remarks
(
Closing
)
>
SlidesLive Video |
🔗 |
-
|
Text-to-Image Diffusion Models are Zero-Shot Classifiers ( Poster ) > link | Kevin Clark · Priyank Jaini 🔗 |