Abstract
Background: One Class Modelling (CM) is popular among chemometricians, but not well known among omics scientists in general. One issue is that typical CM approaches, including SIMCA, often result in unsatisfactory results due to e.g. large variation, centring and scaling issues, sparsity, outliers, and non-linearities in typical omics data. These effects can cause an inflated decision boundary (of the target class), thereby returning many false positives (of non-target cases). Tree-based techniques are by nature resistant to these challenges. In this study we explore tree-Based SIMCA variants in omics scenarios and compare to existing strategies. Results: We present a non-linear form of SIMCA by making use of sample proximities obtained through Unsupervised Random Forest and Isolation Forest (termed URF-SIMCA and IF-SIMCA). We compare accuracy of the algorithms with (traditional) SIMCA, one-class support vector machines, and isolation forest. This comparison was based on five (previously published) clinical omics datasets and the wine-dataset. URF-SIMCA showed superior behaviour. Using the pseudo-sampling principles, an interpretation could be made on the important features for the separation between the target and non-target classes. Using the wine-dataset, we empirically show that these directly relate to information obtained through two-class algorithms. Moreover, feature trajectories in the score- and orthogonal distance spaces further enable interpretability of the model. Significance: URF-SIMCA offers an easy to use extension of SIMCA, which deflates the variance of the target class, allowing for better separation. The increased modelling performance comes at the cost of feature interpretation, but this can be tackled using the pseudo-sampling principle.
| Original language | English |
|---|---|
| Article number | 344889 |
| Journal | Analytica Chimica Acta |
| Volume | 1383 |
| DOIs | |
| Publication status | Published - 15 Jan 2026 |
Bibliographical note
Publisher Copyright:© 2025
Keywords
- Isolation forest
- Non-linear
- Omics
- Pseudosampling
- Soft independent modelling of class analogy
- Unsupervised random forest
- URF-SIMCA
Fingerprint
Dive into the research topics of 'Tree-based SIMCA for dealing with heterogeneous and sparse data'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver