Pseudo labeling and classification of high-dimensional data using visual analytics

Bárbara Caroline Benato

Research output: ThesisDoctoral thesis 1 (Research UU / Graduation UU)

Abstract

Machine learning (ML) works with data consisting of tens up to tens of thousands of measurements (dimensions) per sample. As the number of dimensions and/or samples grow, so does the difficulty of understanding such data and, related to that, understanding how to design ML pipelines that effectively process such data for tasks such as classification. Visualization, and in particular Visual Analytics (VA) has emerged as one of the key approaches that helps practitioners with the understanding of high-dimensional data and with ML engineering tasks. This thesis studies several novel approaches by which VA can help ML (and conversely), as follows. Our work focuses on a visualization technique called dimensionality reduction, or projection, which handles efficiently and effectively large amounts of high-dimensional data. One the ML side, we consider the task of training a typical classifier for the challenging context when only a small amount of ground-truth labels is available. We first propose a pseudo-labeling approach that explores the ability of projections to generate a reduced feature space with enough information to improve feature learning and classifier performance over iterations. We show that the 2D space generated by projections can capture very well the data structure present in high dimensions so as to support the design of high-performance feature and classifier learning models. Secondly, we link data separation (DS), visual separation (VS), and classifier performance (CP) by pseudo-labeling and projections. We use feature spaces with high DS as input to compute high-VS projections. We use these projections to perform pseudo labeling with high propagation accuracies. Finally, we use such labels to train classifiers with a high CP. We show that the high-DS, high-VS, high-CP implication holds for several types of projection techniques. Hence, such projection techniques are suitable for the task of classifier engineering. Thirdly, we exploit the aforementioned observation that high-VS and high-CP are correlated to propose a metric to assess the VS of labeled 2D scatterplots produced by projection techniques. Our metric computes the accuracy of label propagation in the projection space, which is simple and fast to execute. We show that high propagation accuracies match a high VS as assessed by human subjects. Finally, we join all our contributions to incorporate the user in the ML engineering process. We propose an interactive VA tool that assists users in manual labeling samples by providing additional information in terms of classifier decision boundary maps, projection errors, and inverse projection errors. Our results show that this approach enables users to quickly generate labeled samples that lead to higher classification performance after a few labeling iterations. This contribution shows that both algorithms and humans can exploit projections to build better classifiers.
Original languageEnglish
QualificationDoctor of Philosophy
Awarding Institution
  • Utrecht University
Supervisors/Advisors
  • Telea, Alex, Supervisor
  • Falcão, A.X., Supervisor, External person
Award date9 Jul 2024
Place of PublicationUtrecht
Publisher
DOIs
Publication statusPublished - 9 Jul 2024

Keywords

  • machine learning
  • pseudo labeling
  • classifier performance
  • dimensionality reduction
  • multi-dimensional projections
  • high-dimensional data
  • quality of projections
  • manual labeling
  • active learning
  • visual analytics

Fingerprint

Dive into the research topics of 'Pseudo labeling and classification of high-dimensional data using visual analytics'. Together they form a unique fingerprint.

Cite this