Abstract
Various benchmarks have been proposed to test linguistic understanding in pre-trained vision & language (VL) models. Here we build on the existence task from the VALSE benchmark (Parcalabescu et al., 2022) which we use to test models’ understanding of negation, a par- ticularly interesting issue for multimodal mod- els. However, while such VL benchmarks are useful for measuring model performance, they do not reveal anything about the internal pro- cesses through which these models arrive at their outputs in such visio-linguistic tasks. We take inspiration from the growing literature on model interpretability to explain the behaviour of VL models on the understanding of nega- tion. Specifically, we approach these questions through an in-depth analysis of the text encoder in CLIP (Radford et al., 2021), a highly influen- tial VL model. We localise parts of the encoder that process negation and analyse the role of at- tention heads in this task. Our contributions are threefold. We demonstrate how methods from the language model interpretability literature (such as causal tracing) can be translated to mul- timodal models and tasks; we provide concrete insights into how CLIP processes negation on the VALSE existence task; and we highlight inherent limitations in the VALSE dataset as a benchmark for linguistic understanding.
Original language | English |
---|---|
Title of host publication | ALVR 2024 |
Publisher | Association for Computational Linguistics |
Pages | 59-72 |
Number of pages | 14 |
Publication status | Published - Aug 2024 |
Event | Advances in Language and Vision Research (ALVR) - Bangkok, Thailand Duration: 16 Aug 2024 → 16 Aug 2024 Conference number: 3 https://alvr-workshop.github.io/ |
Workshop
Workshop | Advances in Language and Vision Research (ALVR) |
---|---|
Abbreviated title | ALVR |
Country/Territory | Thailand |
City | Bangkok |
Period | 16/08/24 → 16/08/24 |
Internet address |
Bibliographical note
Publisher Copyright:© 2024 Association for Computational Linguistics.