Video-and-Language (VidL) models and their cognitive relevance

A Zonneveld, A Gatt, I Calixto

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

Abstract

In this paper we give a narrative review of multi-modal
video-language (VidL) models. We introduce the current
landscape of VidL models and benchmarks, and draw inspiration from neuroscience and cognitive science to propose avenues for future research in VidL models in particular and artifcial intelligence (AI) in general. We argue
that iterative feedback loops between AI, neuroscience, and
cognitive science are essential to spur progress across these
disciplines. We motivate why we focus specifcally on VidL
models and their benchmarks as a promising type of model
to bring improvements in AI and categorise current VidL efforts across multiple ‘cognitive relevance axioms’. Finally,
we provide suggestions on how to effectively incorporate
this interdisciplinary viewpoint into research on VidL models in particular and AI in general. In doing so, we hope to
create awareness of the potential of VidL models to narrow
the gap between neuroscience, cognitive science, and AI.
Original languageEnglish
Title of host publicationWhat is next in Multimodal Foundational Models? Proceedings of the ICCV Workshop.
Place of PublicationParis, France
PublisherIEEE
Pages1-14
Number of pages14
Publication statusPublished - 2023

Keywords

  • vision and language

Fingerprint

Dive into the research topics of 'Video-and-Language (VidL) models and their cognitive relevance'. Together they form a unique fingerprint.

Cite this