TY - CHAP
T1 - SynODC: Utilizing the Syntactic Structure for Outlier Detection in Categorical Attributes
AU - Zylinski, Arthur
AU - Qahtan, Abdulhakim A.
PY - 2024/8/22
Y1 - 2024/8/22
N2 - The problem of outlier detection is a long-standing problem, where outliers affect the data quality significantly. Machine learning models that are trained on a low quality data tend to produce inaccurate decisions and poor predictions. While detecting outliers in numerical data has been extensively studied, few attempts were made to solve the problem of detecting outliers in attributes with categorical values. In this paper, we introduce SynODC for detecting categorical outliers in relational (tabular) datasets by utilizing the syntactic structure of the values. For a given attribute, SynODC identifies a set of patterns that represent the majority of the values as dominating patterns. Data values that do not match (i.e. cannot be generated by) one of the dominating patterns are declared as outliers. Our target is to construct, for each attribute, a minimal set of dominating patterns that are expressive enough to represent the different formats of the values in the attribute. To do that, we define a new distance metric that generalizes the Levenshtein distance to measure the distance between the patterns. Using the new distance metric, SynODC combines similar patterns to maintain compact representations of the attributes. The experimental results on multiple real-world datasets prove the effectiveness of SynODC in detecting syntactic outliers that cannot be detected by other data cleaning tools.
AB - The problem of outlier detection is a long-standing problem, where outliers affect the data quality significantly. Machine learning models that are trained on a low quality data tend to produce inaccurate decisions and poor predictions. While detecting outliers in numerical data has been extensively studied, few attempts were made to solve the problem of detecting outliers in attributes with categorical values. In this paper, we introduce SynODC for detecting categorical outliers in relational (tabular) datasets by utilizing the syntactic structure of the values. For a given attribute, SynODC identifies a set of patterns that represent the majority of the values as dominating patterns. Data values that do not match (i.e. cannot be generated by) one of the dominating patterns are declared as outliers. Our target is to construct, for each attribute, a minimal set of dominating patterns that are expressive enough to represent the different formats of the values in the attribute. To do that, we define a new distance metric that generalizes the Levenshtein distance to measure the distance between the patterns. Using the new distance metric, SynODC combines similar patterns to maintain compact representations of the attributes. The experimental results on multiple real-world datasets prove the effectiveness of SynODC in detecting syntactic outliers that cannot be detected by other data cleaning tools.
U2 - 10.1007/978-3-031-70359-1_13
DO - 10.1007/978-3-031-70359-1_13
M3 - Chapter
SN - 978-3-031-70358-4
T3 - Lecture Notes in Computer Science
SP - 213
EP - 229
BT - ECML PKDD 2024: Machine Learning and Knowledge Discovery in Databases
PB - Springer
ER -