Introducing CAD: the Contextual Abuse Dataset

Bertie Vidgen, Dong Nguyen, Helen Margetts, Patricia Rossini, Rebekah Tromble

    Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

    Abstract

    Online abuse can inflict harm on users and communities, making online spaces unsafe and toxic. Progress in automatically detecting and classifying abusive content is often held back by the lack of high quality and detailed datasets.We introduce a new dataset of primarily English Reddit entries which addresses several limitations of prior work. It (1) contains six conceptually distinct primary categories as well as secondary categories, (2) has labels annotated in the context of the conversation thread, (3) contains rationales and (4) uses an expert-driven group-adjudication process for high quality annotations. We report several baseline models to benchmark the work of future researchers. The annotated dataset, annotation guidelines, models and code are freely available.
    Original languageEnglish
    Title of host publicationProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
    EditorsKristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, Yichao Zhou
    PublisherAssociation for Computational Linguistics
    Pages2289-2303
    Number of pages15
    DOIs
    Publication statusPublished - Jun 2021

    Fingerprint

    Dive into the research topics of 'Introducing CAD: the Contextual Abuse Dataset'. Together they form a unique fingerprint.

    Cite this