Skip to main navigation Skip to search Skip to main content

Tree-based SIMCA for dealing with heterogeneous and sparse data

  • Robert van Vorstenbosch*
  • , Frederik Jan van Schooten
  • , Zlatan Mujagic
  • , Agnieszka Smolinska
  • *Corresponding author for this work

Research output: Contribution to journalArticleAcademicpeer-review

Abstract

Background: One Class Modelling (CM) is popular among chemometricians, but not well known among omics scientists in general. One issue is that typical CM approaches, including SIMCA, often result in unsatisfactory results due to e.g. large variation, centring and scaling issues, sparsity, outliers, and non-linearities in typical omics data. These effects can cause an inflated decision boundary (of the target class), thereby returning many false positives (of non-target cases). Tree-based techniques are by nature resistant to these challenges. In this study we explore tree-Based SIMCA variants in omics scenarios and compare to existing strategies. Results: We present a non-linear form of SIMCA by making use of sample proximities obtained through Unsupervised Random Forest and Isolation Forest (termed URF-SIMCA and IF-SIMCA). We compare accuracy of the algorithms with (traditional) SIMCA, one-class support vector machines, and isolation forest. This comparison was based on five (previously published) clinical omics datasets and the wine-dataset. URF-SIMCA showed superior behaviour. Using the pseudo-sampling principles, an interpretation could be made on the important features for the separation between the target and non-target classes. Using the wine-dataset, we empirically show that these directly relate to information obtained through two-class algorithms. Moreover, feature trajectories in the score- and orthogonal distance spaces further enable interpretability of the model. Significance: URF-SIMCA offers an easy to use extension of SIMCA, which deflates the variance of the target class, allowing for better separation. The increased modelling performance comes at the cost of feature interpretation, but this can be tackled using the pseudo-sampling principle.

Original languageEnglish
Article number344889
JournalAnalytica Chimica Acta
Volume1383
DOIs
Publication statusPublished - 15 Jan 2026

Bibliographical note

Publisher Copyright:
© 2025

Keywords

  • Isolation forest
  • Non-linear
  • Omics
  • Pseudosampling
  • Soft independent modelling of class analogy
  • Unsupervised random forest
  • URF-SIMCA

Fingerprint

Dive into the research topics of 'Tree-based SIMCA for dealing with heterogeneous and sparse data'. Together they form a unique fingerprint.

Cite this