Abstract
Deidentifying sensitive information in electronic health records (EHRs) is increasingly important as legal obligations to data privacy evolve along with the need to protect patient and institutional confidentiality. This study aims to comparatively evaluate the performance of two state-of-the-art deidentification systems, Deduce and Deidentify, on both real-world and synthetic Dutch medical texts, thereby providing insights into their relative strengths and limitations in preserving privacy while maintaining data utility. We employ a replication-extension research design, utilizing two distinct datasets: (1) the Annotation-Based Dataset from the Utrecht University Medical Center (UMC Utrecht), comprising manually annotated patient records spanning 1987 to 2021, and (2) the Synthetic Dataset, generated using a two-step process involving OpenAI’s GPT-4 model. Utilizing precision, recall, and (Formula presented.) scores as evaluation metrics, we uncover the relative strengths and limitations of the two methods. Our findings indicate that both techniques show variable performance across different entities of deidentifying text information. Deduce outperforms Deidentify in overall accuracy by a margin of 0.42 on the synthetic datasets. On the real-world annotation-based dataset, the generalization ability of Deidentify is lower than Deduce by 0.2. However, the performance of both techniques is affected by the limitations of the dataset. In conclusion, this study provides valuable insights into the comparative performance of Deduce and Deidentify for deidentifying Dutch EHRs, contributing to the development of more effective privacy preservation techniques in the healthcare domain.
Original language | English |
---|---|
Article number | 1636 |
Journal | Electronics (Switzerland) |
Volume | 14 |
Issue number | 8 |
DOIs | |
Publication status | Published - 18 Apr 2025 |
Bibliographical note
Publisher Copyright:© 2025 by the authors.
Keywords
- Dutch medical records
- deep learning methods
- machine learning
- named entity recognition
- natural language processing
- privacy information