Progressive Entity Matching: A Design Space Exploration

Jakub Maciejewski, Konstantinos Nikoletos, George Papadakis, Yannis Velegrakis

Research output: Contribution to journalArticleAcademicpeer-review

Abstract

Entity Resolution (ER) is typically implemented as a batch task that processes all available data before identifying duplicate records. However, applications with time or computational constraints, e.g., those running in the cloud, require a progressive approach that produces results in a pay-as-you-go fashion. Numerous algorithms have been proposed for Progressive ER in the literature. In this work, we propose a novel framework for Progressive Entity Matching that organizes relevant techniques into four consecutive steps: (i) filtering, which reduces the search space to the most likely candidate matches, (ii) weighting, which associates every pair of candidate matches with a similarity score, (iii) scheduling, which prioritizes the execution of the candidate matches so that the real duplicates precede the non-matching pairs, and (iv) matching, which applies a complex, matching function to the pairs in the order defined by the previous step. We associate each step with existing and novel techniques, illustrating that our framework overall generates a superset of the main existing works in the field. We select the most representative combinations resulting from our framework and fine-tune them over 10 established datasets for Record Linkage and 8 for Deduplication, with our results indicating that our taxonomy yields a wide range of high performing progressive techniques both in terms of effectiveness and time efficiency.
Original languageEnglish
Article number65
Number of pages25
JournalProceedings of the ACM on Management of Data
Volume3
Issue number1
DOIs
Publication statusPublished - 10 Feb 2025

Fingerprint

Dive into the research topics of 'Progressive Entity Matching: A Design Space Exploration'. Together they form a unique fingerprint.

Cite this