Abstract
In this paper we give a narrative review of multi-modal
video-language (VidL) models. We introduce the current
landscape of VidL models and benchmarks, and draw inspiration from neuroscience and cognitive science to propose avenues for future research in VidL models in particular and artifcial intelligence (AI) in general. We argue
that iterative feedback loops between AI, neuroscience, and
cognitive science are essential to spur progress across these
disciplines. We motivate why we focus specifcally on VidL
models and their benchmarks as a promising type of model
to bring improvements in AI and categorise current VidL efforts across multiple ‘cognitive relevance axioms’. Finally,
we provide suggestions on how to effectively incorporate
this interdisciplinary viewpoint into research on VidL models in particular and AI in general. In doing so, we hope to
create awareness of the potential of VidL models to narrow
the gap between neuroscience, cognitive science, and AI.
video-language (VidL) models. We introduce the current
landscape of VidL models and benchmarks, and draw inspiration from neuroscience and cognitive science to propose avenues for future research in VidL models in particular and artifcial intelligence (AI) in general. We argue
that iterative feedback loops between AI, neuroscience, and
cognitive science are essential to spur progress across these
disciplines. We motivate why we focus specifcally on VidL
models and their benchmarks as a promising type of model
to bring improvements in AI and categorise current VidL efforts across multiple ‘cognitive relevance axioms’. Finally,
we provide suggestions on how to effectively incorporate
this interdisciplinary viewpoint into research on VidL models in particular and AI in general. In doing so, we hope to
create awareness of the potential of VidL models to narrow
the gap between neuroscience, cognitive science, and AI.
Original language | English |
---|---|
Title of host publication | What is next in Multimodal Foundational Models? Proceedings of the ICCV Workshop. |
Place of Publication | Paris, France |
Publisher | IEEE |
Pages | 1-14 |
Number of pages | 14 |
Publication status | Published - 2023 |
Keywords
- vision and language