SynODC: Utilizing the Syntactic Structure for Outlier Detection in Categorical Attributes

Arthur Zylinski, Abdulhakim A. Qahtan*

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingChapterAcademicpeer-review

Abstract

The problem of outlier detection is a long-standing problem, where outliers affect the data quality significantly. Machine learning models that are trained on a low quality data tend to produce inaccurate decisions and poor predictions. While detecting outliers in numerical data has been extensively studied, few attempts were made to solve the problem of detecting outliers in attributes with categorical values. In this paper, we introduce SynODC for detecting categorical outliers in relational (tabular) datasets by utilizing the syntactic structure of the values. For a given attribute, SynODC identifies a set of patterns that represent the majority of the values as dominating patterns. Data values that do not match (i.e. cannot be generated by) one of the dominating patterns are declared as outliers. Our target is to construct, for each attribute, a minimal set of dominating patterns that are expressive enough to represent the different formats of the values in the attribute. To do that, we define a new distance metric that generalizes the Levenshtein distance to measure the distance between the patterns. Using the new distance metric, SynODC combines similar patterns to maintain compact representations of the attributes. The experimental results on multiple real-world datasets prove the effectiveness of SynODC in detecting syntactic outliers that cannot be detected by other data cleaning tools.
Original languageEnglish
Title of host publicationECML PKDD 2024: Machine Learning and Knowledge Discovery in Databases
Subtitle of host publicationResearch Track
PublisherSpringer
Pages213-229
ISBN (Electronic)978-3-031-70359-1
ISBN (Print)978-3-031-70358-4
DOIs
Publication statusPublished - 22 Aug 2024

Publication series

NameLecture Notes in Computer Science
Volume14944
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Fingerprint

Dive into the research topics of 'SynODC: Utilizing the Syntactic Structure for Outlier Detection in Categorical Attributes'. Together they form a unique fingerprint.

Cite this