Estimating the extent of the effects of data quality through observations

Daniele Foroni, Matteo Lissandrini, Yannis Velegrakis

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

Abstract

Existing data quality works have so far focused on the computation of many data characteristics as a mean of quantifying different quality dimensions, like freshness, consistency, accuracy, or completeness, that are all defined about some ideal (clean) dataset. We claim that this approach falls short in providing a full specification of the quality of the data since it does not take into consideration the task for which the data is to be used, neither any future instances of the dataset. We argue that apart from the difference from the clean dataset, it is equally important to know the degree to which such difference affects the results of the task at hand. Thus, we extend the existing data quality definition to include that degree. Our approach, not only allows data quality to be considered in the context of the intended task, but can also provide useful information even in the absence of the clean dataset, and proffer an understanding of the effect of data quality in future dataset instances. We describe a system and its implementation that computes this extended form of data quality through a principled approach of systematic noise generation and task result evaluation. We perform numerous experiments illustrating the effectiveness of the approach and how this allows contextualizing traditional data quality measures.

Original languageEnglish
Title of host publication2021 IEEE 37th International Conference on Data Engineering (ICDE)
PublisherIEEE Computer Society
Pages1913-1918
Number of pages6
ISBN (Electronic)9781728191843
DOIs
Publication statusPublished - Apr 2021
Event37th IEEE International Conference on Data Engineering, ICDE 2021 - Virtual, Chania, Greece
Duration: 19 Apr 202122 Apr 2021

Publication series

NameProceedings - International Conference on Data Engineering
Volume2021-April
ISSN (Print)1084-4627

Conference

Conference37th IEEE International Conference on Data Engineering, ICDE 2021
Country/TerritoryGreece
CityVirtual, Chania
Period19/04/2122/04/21

Keywords

  • Data cleaning
  • Data mining
  • Data quality

Fingerprint

Dive into the research topics of 'Estimating the extent of the effects of data quality through observations'. Together they form a unique fingerprint.

Cite this