Statistical Methods for Annotation Analysis

Silviu Paun, Ron Artstein, Massimo Poesio

Research output: Chapter in Book/Report/Conference proceedingChapterAcademicpeer-review

Abstract

Labelling data is one of the most fundamental activities in science, and has underpinned practice, particularly in medicine, for decades, as well as research in corpus linguistics since at least the development of the Brown corpus. With the shift towards Machine Learning in Artificial Intelligence (AI), the creation of datasets to be used for training and evaluating AI systems, also known in AI as corpora, has become a central activity in the field as well. Early AI datasets were created on an ad-hoc basis to tackle specific problems. As larger and more reusable datasets were created, requiring greater investment, the need for a more systematic approach to dataset creation arose to ensure increased quality. A range of statistical methods were adopted, often but not exclusively from the medical sciences, to ensure that the labels used were not subjective, or to choose among different labels provided by the coders. A wide variety of such methods is now in regular use. This book is meant to provide a survey of the most widely used among these statistical methods supporting annotation practice. As far as the authors know, this is the first book attempting to cover the two families of methods in wider use. The first family of methods is concerned with the development of labelling schemes and, in particular, ensuring that such schemes are such that sufficient agreement can be observed among the coders. The second family includes methods developed to analyze the output of coders once the scheme has been agreed upon, particularly although not exclusively to identify the most likely label for an item among those provided by the coders. The focus of this book is primarily on Natural Language Processing, the area of AI devoted to the development of models of language interpretation and production, but many if not most of the methods discussed here are also applicable to other areas of AI, or indeed, to other areas of Data Science.

Original languageEnglish
Title of host publicationSynthesis Lectures on Human Language Technologies
Subtitle of host publicationLecture #54
PublisherMorgan and Claypool Publishers
Pages1-217
Number of pages217
Edition1
DOIs
Publication statusPublished - 2022
Externally publishedYes

Publication series

NameSynthesis Lectures on Human Language Technologies
Number1
Volume15
ISSN (Print)1947-4040
ISSN (Electronic)1947-4059

Bibliographical note

Publisher Copyright:
Copyright © 2022 by Morgan & Claypool.

Funding

Ron Artstein was sponsored by the U.S. Army Research Laboratory (ARL) under contract number W911NF-14-D-0005. Statements and opinions expressed and content included do not necessarily reflect the position or the policy of the government, and no official endorsement should be inferred. Silviu Paun and Massimo Poesio were supported by the DALI project, ERC Advanced Grant 695662 to Massimo Poesio.

FundersFunder number
United States Army Research LaboratoryW911NF-14-D-0005
Not added695662
European Research Council

    Keywords

    • agreement
    • coefficients of agreement
    • corpus annotation
    • latent models
    • neural models for learning from the crowd
    • probabilistic annotation models
    • statistics
    • variational autoencoders

    Fingerprint

    Dive into the research topics of 'Statistical Methods for Annotation Analysis'. Together they form a unique fingerprint.

    Cite this