Abstract
Existing data quality works have so far focused on the computation of many data characteristics as a mean of quantifying different quality dimensions, like freshness, consistency, accuracy, or completeness, that are all defined about some ideal (clean) dataset. We claim that this approach falls short in providing a full specification of the quality of the data since it does not take into consideration the task for which the data is to be used, neither any future instances of the dataset. We argue that apart from the difference from the clean dataset, it is equally important to know the degree to which such difference affects the results of the task at hand. Thus, we extend the existing data quality definition to include that degree. Our approach, not only allows data quality to be considered in the context of the intended task, but can also provide useful information even in the absence of the clean dataset, and proffer an understanding of the effect of data quality in future dataset instances. We describe a system and its implementation that computes this extended form of data quality through a principled approach of systematic noise generation and task result evaluation. We perform numerous experiments illustrating the effectiveness of the approach and how this allows contextualizing traditional data quality measures.
| Original language | English |
|---|---|
| Title of host publication | 2021 IEEE 37th International Conference on Data Engineering (ICDE) |
| Publisher | IEEE |
| Pages | 1913-1918 |
| Number of pages | 6 |
| ISBN (Electronic) | 9781728191843 |
| DOIs | |
| Publication status | Published - Apr 2021 |
| Event | 37th IEEE International Conference on Data Engineering, ICDE 2021 - Virtual, Chania, Greece Duration: 19 Apr 2021 → 22 Apr 2021 |
Publication series
| Name | Proceedings - International Conference on Data Engineering |
|---|---|
| Volume | 2021-April |
| ISSN (Print) | 1084-4627 |
Conference
| Conference | 37th IEEE International Conference on Data Engineering, ICDE 2021 |
|---|---|
| Country/Territory | Greece |
| City | Virtual, Chania |
| Period | 19/04/21 → 22/04/21 |
Bibliographical note
Publisher Copyright:© 2021 IEEE.
Keywords
- Data cleaning
- Data mining
- Data quality