Challenges in Reproducing Human Evaluation Results for Role-Oriented Dialogue Summarization

Takumi Ito, Qixiang Fang, Pablo Mosteiro Romero*, Albert Gatt, Kees van Deemter

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

Abstract

There is a growing concern regarding the reproducibility of human evaluation studies in NLP. As part of the ReproHum campaign, we conducted a study to assess the reproducibility of a recent human evaluation study in NLP. Specifically, we attempted to reproduce a human evaluation of a novel approach to enhance Role-Oriented Dialogue Summarization by considering the influence of role interactions. Despite our best efforts to adhere to the reported setup, we were unable to reproduce the statistical results as presented in the original paper. While no contradictory evidence was found, our study raises questions about the validity of the reported statistical significance results, and/or the comprehensiveness with which the original study was reported. In this paper, we provide a comprehensive account of our reproduction study, detailing the methodologies employed, data collection, and analysis procedures. We discuss the implications of our findings for the broader issue of reproducibility in NLP research. Our findings serve as a cautionary reminder of the challenges in conducting reproducible human evaluations and prompt further discussions within the NLP community.
Original languageEnglish
Title of host publicationThe 3rd Workshop on Human Evaluation of NLP Systems (HumEval’23)
PublisherAssociation for Computational Linguistics
Number of pages27
Publication statusPublished - 15 Aug 2023

Fingerprint

Dive into the research topics of 'Challenges in Reproducing Human Evaluation Results for Role-Oriented Dialogue Summarization'. Together they form a unique fingerprint.

Cite this