Abstract
This paper proposes a robust method that identifies sets of points that collectively deviate from typical patterns in a dataset, which it calls “outlier sets”, while excluding individual points from detection. This new methodology, Outlier Set Two-step Identification (OSTI) employs a two-step approach to detect and label these outlier sets. First, it uses Gaussian Mixture Models for probabilistic clustering, identifying candidate outlier sets based on cluster weights below a hyperparameter threshold. Second, OSTI measures the Inter-cluster Mahalanobis distance between each candidate outlier set's centroid and the overall dataset mean. OSTI then tests the null hypothesis that this distance does not significantly differ from its theoretical chi-square distribution, enabling the formal detection of outlier sets. We test OSTI systematically on 8000 synthetic 2D datasets across various inlier configurations and thousands of possible outlier set characteristics. Results show OSTI robustly and consistently detects outlier sets with an average F1 score of 0.92 and an average purity (the degree to which outlier sets identified correspond to those generated synthetically, i.e., our ground truth) of 98.58 %. We also compare OSTI with state-of-the-art outlier detection methods, to illuminate how OSTI fills a gap as a tool for the exclusive detection of outlier sets.
| Original language | English |
|---|---|
| Article number | 114274 |
| Number of pages | 18 |
| Journal | Knowledge-Based Systems |
| Volume | 329 |
| Early online date | 15 Aug 2025 |
| DOIs | |
| Publication status | Published - 4 Nov 2025 |
Bibliographical note
Publisher Copyright:© 2025 Elsevier B.V.
Keywords
- Gaussian mixture models
- Inter-cluster Mahalanobis distance
- Outlier set two-step identification (OSTI)
- Outlier sets