Gradations of Error Severity in Automatic Image Descriptions

Emiel van Miltenburg, Wei-Ting Lu, Emiel Krahmer, Albert Gatt, Guanyi Chen, Lin Li, Kees van Deemter

    Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

    Abstract

    Earlier research has shown that evaluation metrics based on textual similarity (e.g., BLEU, CIDEr, Meteor) do not correlate well with human evaluation scores for automatically generated text. We carried out an experiment with Chinese speakers, where we systematically manipulated image descriptions to contain different kinds of errors. Because our manipulated descriptions form minimal pairs with the reference descriptions, we are able to assess the impact of different kinds of errors on the perceived quality of the descriptions. Our results show that different kinds of errors elicit significantly different evaluation scores, even though all erroneous descriptions differ in only one character from the reference descriptions. Evaluation metrics based solely on textual similarity are unable to capture these differences, which (at least partially) explains their poor correlation with human judgments. Our work provides the foundations for future work, where we aim to understand why different errors are seen as more or less severe.
    Original languageEnglish
    Title of host publicationProceedings of the 13th International Conference on Natural Language Generation
    EditorsBrian Davis, Yvette Graham, John Kelleher, Yaji Sripada
    Place of PublicationDublin, Ireland
    PublisherAssociation for Computational Linguistics
    Pages398-411
    Number of pages14
    Publication statusPublished - 1 Dec 2020

    Fingerprint

    Dive into the research topics of 'Gradations of Error Severity in Automatic Image Descriptions'. Together they form a unique fingerprint.

    Cite this