Investigating De-Identification Methodologies in Dutch Medical Texts: A Replication Study of Deduce and Deidentify

Pablo Mosteiro Romero*, Ruilin Wang, F.E. Scheepers, Marco Spruit

*Corresponding author for this work

Research output: Contribution to journalArticleAcademicpeer-review

Abstract

Deidentifying sensitive information in electronic health records (EHRs) is increasingly important as legal obligations to data privacy evolve along with the need to protect patient and institutional confidentiality. This study aims to comparatively evaluate the performance of two state-of-the-art deidentification systems, Deduce and Deidentify, on both real-world and synthetic Dutch medical texts, thereby providing insights into their relative strengths and limitations in preserving privacy while maintaining data utility. We employ a replication-extension research design, utilizing two distinct datasets: (1) the Annotation-Based Dataset from the Utrecht University Medical Center (UMC Utrecht), comprising manually annotated patient records spanning 1987 to 2021, and (2) the Synthetic Dataset, generated using a two-step process involving OpenAI’s GPT-4 model. Utilizing precision, recall, and (Formula presented.) scores as evaluation metrics, we uncover the relative strengths and limitations of the two methods. Our findings indicate that both techniques show variable performance across different entities of deidentifying text information. Deduce outperforms Deidentify in overall accuracy by a margin of 0.42 on the synthetic datasets. On the real-world annotation-based dataset, the generalization ability of Deidentify is lower than Deduce by 0.2. However, the performance of both techniques is affected by the limitations of the dataset. In conclusion, this study provides valuable insights into the comparative performance of Deduce and Deidentify for deidentifying Dutch EHRs, contributing to the development of more effective privacy preservation techniques in the healthcare domain.

Original languageEnglish
Article number1636
JournalElectronics (Switzerland)
Volume14
Issue number8
DOIs
Publication statusPublished - 18 Apr 2025

Bibliographical note

Publisher Copyright:
© 2025 by the authors.

Keywords

  • Dutch medical records
  • deep learning methods
  • machine learning
  • named entity recognition
  • natural language processing
  • privacy information

Fingerprint

Dive into the research topics of 'Investigating De-Identification Methodologies in Dutch Medical Texts: A Replication Study of Deduce and Deidentify'. Together they form a unique fingerprint.

Cite this