Abstract
This dissertation is dedicated to image captioning, the task of automatically generating a natural language description of a given image. Most modern automatic caption generators are trained to produce a straightforward visual description of what can be directly seen in the image. By contrast, a human-written caption may also include information that cannot be inferred from the image alone: references to image-external world knowledge. Exploring ways to enrich automatic image captioning with contextually relevant external knowledge is the main focus of this dissertation.
The general approach we develop begins with the identification and extraction of relevant external knowledge. This task is carried out by a contextualization anchor, an element of image-related data that is used to determine which part of the world knowledge available in external resources would be useful for captioning a given image. Through the contextualization anchor, we identify real world entities that are relevant to the image, which make up an entity context. We further retrieve various facts about these entities, creating an informative knowledge context. We integrate both entity and knowledge contexts into a neural encoder-decoder captioning pipeline as extra sources of information for generating the caption. The goal of the resulting “knowledge-aware” captioning model is to generate captions that are influenced by the relevant external knowledge and possibly include explicit references to it. During evaluation, we pay special attention to measuring factual accuracy, the veridicality of image-external knowledge in the automatically generated captions.
Based on this approach, we develop three image captioning models. Their training data, which includes two new datasets we compile, contains naturally produced captions with abundant references to external knowledge.
The first model focuses on geographic knowledge in particular. It uses image location metadata as a contextualization anchor to identify geographic entities in and around the image. These entities make up the geographic entity context, which provides extra input for the encoder and an additional vocabulary for the decoder, allowing it to generate entity names in the captions. The evaluation shows a substantial improvement over the standard baseline models, particularly in the ability to correctly produce specific geographic references.
The second model additionally includes the knowledge context, which consists of diverse encyclopedic facts about the relevant entities. It is used as another input to the encoder, and in the decoder it provides extra contextualization for the generation of regular words and another vocabulary for generating fact-related tokens. In our experiments, this model confidently outperforms various baseline models in standard captioning metrics and, importantly, in the accuracy of the generated facts.
The third model extends beyond the geographic domain and applies our approach to the qualitatively different data: images from newspaper articles. Here, the article itself is used as a contextualization anchor, the entity context is constructed from named entities of various types (not only geographic), collected from the article text, and the knowledge context includes encyclopedic facts about these entities. The resulting model is able to generate contextualized captions that incorporate information from both the article and an external knowledge base.
Original language | English |
---|---|
Qualification | Doctor of Philosophy |
Awarding Institution |
|
Supervisors/Advisors |
|
Award date | 5 Apr 2023 |
Publisher | |
Print ISBNs | 978-94-6093-426-1 |
DOIs | |
Publication status | Published - 5 Apr 2023 |
Keywords
- image captioning
- caption generation
- natural language generation
- contextualization
- knowledge integration
- multimodality
- language grounding
- geographic information
- encyclopedic knowledge