Skip to main navigation Skip to search Skip to main content

Phrase detectives corpus 1.0 crowdsourced anaphoric coreference

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

Abstract

Natural Language Engineering tasks require large and complex annotated datasets to build more advanced models of language. Corpora are typically annotated by several experts to create a gold standard; however, there are now compelling reasons to use a non-expert crowd to annotate text, driven by cost, speed and scalability. Phrase Detectives Corpus 1.0 is an anaphorically-annotated corpus of encyclopedic and narrative text that contains a gold standard created by multiple experts, as well as a set of annotations created by a large non-expert crowd. Analysis shows very good inter-expert agreement (κ =.88 -.93) but a more variable baseline crowd agreement (κ =.52 -.96). Encyclopedic texts show less agreement (and by implication are harder to annotate) than narrative texts. The release of this corpus is intended to encourage research into the use of crowds for text annotation and the development of more advanced, probabilistic language models, in particular for anaphoric coreference.

Original languageEnglish
Title of host publicationProceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016
EditorsNicoletta Calzolari, Khalid Choukri, Helene Mazo, Asuncion Moreno, Thierry Declerck, Sara Goggi, Marko Grobelnik, Jan Odijk, Stelios Piperidis, Bente Maegaard, Joseph Mariani
PublisherEuropean Language Resources Association (ELRA)
Pages2039-2046
Number of pages8
ISBN (Electronic)9782951740891
Publication statusPublished - 2016
Externally publishedYes
Event10th International Conference on Language Resources and Evaluation, LREC 2016 - Portoroz, Slovenia
Duration: 23 May 201628 May 2016

Publication series

NameProceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016

Conference

Conference10th International Conference on Language Resources and Evaluation, LREC 2016
Country/TerritorySlovenia
CityPortoroz
Period23/05/1628/05/16

Funding

The creation of the original game was funded by EPSRC project AnaWiki, EP/F00575X/1. Dr Chamberlain would also like to acknowledge the EPSRC Doctoral Training Award that enabled the analysis of the corpus.

FundersFunder number
Engineering and Physical Sciences Research CouncilEP/F00575X/1

    Keywords

    • Anaphora
    • Anaphoric coreference
    • Annotation
    • Corpora
    • Crowdsourcing
    • Games-with-a-purpose
    • Gwap
    • Phrase Detectives

    Fingerprint

    Dive into the research topics of 'Phrase detectives corpus 1.0 crowdsourced anaphoric coreference'. Together they form a unique fingerprint.

    Cite this