Invited Talk
in
Workshop: Multimodal Representation Learning (MRL): Perks and Pitfalls
Injecting large models with new modalities for Video Understanding
Arsha Nagrani
Abstract:
Large models have had an `explosion’ moment recently, achieving state of the art results across various benchmarks and tasks. Here we discuss how they can be adapted to novel vision and audio inputs for multimodal tasks, either by influencing model design, or as frozen components in multimodal architectures. We focus on multimodal video captioning tasks such as ASR and automatic AD for movies, and cover some recently accepted papers at CVPR 2023.
Chat is not available.