Skip to main navigation Skip to search Skip to main content

Auto-Configuring Entity Resolution Pipelines

  • Konstantinos Nikoletos*
  • , Vasilis Efthymiou
  • , George Papadakis
  • , Kostas Stefanidis
  • *Corresponding author for this work
  • National and Kapodistrian University of Athens

Research output: Contribution to journalArticleAcademicpeer-review

Abstract

The same real-world entity (e.g., a movie, a restaurant, a person) may be described in various ways on different datasets. Entity Resolution (ER) aims to find such different descriptions of the same entity, this way improving data quality and, therefore, data value. However, an ER pipeline typically involves several steps (e.g., blocking, similarity estimation, clustering), with each step requiring its own configurations and tuning. The choice of the best configuration, among a vast number of possible combinations, is a dataset-specific and labor-intensive task both for novice and expert users, while it often requires some ground truth knowledge of real matches. In this work, we examine ways of automatically configuring a state-of-the-art end-to-end ER pipeline based on pre-trained language models under two settings: i) When ground truth is available. In this case, sampling strategies that are typically used for hyperparameter optimization can significantly restrict the search of the configuration space. We experimentally compare their relative effectiveness and time efficiency, applying them to ER pipelines for the first time. ii) When no ground truth is available. In this case, labelled data extracted from other datasets with available ground truth can be used to train a regression model that predicts the relative effectiveness of parameter configurations. Experimenting with 11 ER benchmark datasets, we evaluate the relative performance of existing techniques that address each problem, but have not been applied to ER before.
Original languageEnglish
Pages (from-to)155367-155384
Number of pages18
JournalIEEE Access
Volume13
DOIs
Publication statusPublished - 2025
Externally publishedYes

Bibliographical note

Publisher Copyright:
© 2013 IEEE.

Funding

This work was supported in part by the EU Project STELAR (Horizon Europe) under Grant 101070122.

FundersFunder number
European Commission
HORIZON EUROPE Framework Programme101070122

    Keywords

    • AutoML
    • Data management
    • entity resolution

    Fingerprint

    Dive into the research topics of 'Auto-Configuring Entity Resolution Pipelines'. Together they form a unique fingerprint.

    Cite this