Revealing Subgroups That Differ in Common and Distinctive Variation in Multi-Block Data: Clusterwise Sparse Simultaneous Component Analysis

S. Yuan, K. de Roover, M. Dufner, J.J.A. Denissen, K. van Deun

Research output: Contribution to journalArticleAcademicpeer-review

Abstract

Social and behavioral studies more and more often yield multi-block data, which consist of novel blocks of data (e.g., data from wearable devices) and traditional blocks of data (e.g., survey data) collected from the same sample. Multi-block data offer researchers valuable insights into complex social mechanisms, where several influences act together. Yet such mechanisms are likely to differ among subgroups. Hence, fully revealing the composite mechanisms underlying multi-block data is challenging, since proper clustering analysis of such data requires methods that simultaneously detect the covariation of variables underlying all data blocks and the group differences therein. Additionally, the methods should be able to handle high-dimensional datasets, which might include many irrelevant variables. Here, we present Clusterwise Sparse Simultaneous Component Analysis (CSSCA), a method that groups the subjects that are driven by the same mechanisms and, at the same time, extracts cluster-specific components that model these mechanisms. By imposing structure constraints, CSSCA further distinguishes common mechanisms that underlie all data blocks from distinctive mechanisms that only underlie one or a few data blocks. In extensive simulations, CSSCA delivered convincing results in recovering the clusters and their associated component structures across various conditions. More importantly, CSSCA showed a clear advantage over existing methods when substantial cluster differences in the component structure were present. We demonstrated the usefulness of CSSCA in an application to data stemming from a study on personality.

Original languageEnglish
Pages (from-to)802-820
Number of pages19
JournalSocial Science Computer Review
Volume39
Issue number5
Early online date2019
DOIs
Publication statusPublished - 1 Oct 2021

Bibliographical note

Funding Information:
Shuai Yuan is a PhD student working at the Department of Methodology and Statistics, Tilburg University. His doctoral project aims to develop new big data analytical methods for social and behavioral sciences. Kim De Roover works as an assistant professor at the Department of Methodology and Statistics, Tilburg University. In her research, she combines component or factor analysis with clustering techniques to obtain hybrid methods for capturing heterogeneity in underlying covariance structure or measurement models of variables. She can be reached at [email protected] Michael Dufner is a personality psychologist working at Medical School Berlin. His research examines topics such as self-perception, implicit personality, and social relations. He can be reached at [email protected] Jaap J. A. Denissen works as a full professor at the Department of Developmental Psychology of Tilburg University. His broad research interests lie in various areas of personality psychology. He can be reached at [email protected] Katrijn Van Deun works as an associate professor at the Department of Methodology and Statistics, Tilburg University. Her research focuses on the development of novel methods for exploration and prediction with high-dimensional multi-block data. She can be reached at [email protected] 1 Tilburg University, Tilburg, The Netherlands 2 University of Leipzig, Germany Shuai Yuan, Tilburg University, Warandelaan 2, Tilburg, The Netherlands. Email: [email protected] This article is part of the SSCR special issue on “Big Data in the Behavioral and Social Sciences”, guest edited by Michael Bosnjak (Leibniz Institute for Psychology Information. Trier, Germany). 2019 0894439319888449 © The Author(s) 2019 2019 SAGE Publications This article is distributed under the terms of the Creative Commons Attribution-NonCommercial 4.0 License ( http://www.creativecommons.org/licenses/by-nc/4.0/ ) which permits non-commercial use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access pages ( https://us.sagepub.com/en-us/nam/open-access-at-sage ). Social and behavioral studies more and more often yield multi-block data, which consist of novel blocks of data (e.g., data from wearable devices) and traditional blocks of data (e.g., survey data) collected from the same sample. Multi-block data offer researchers valuable insights into complex social mechanisms, where several influences act together. Yet such mechanisms are likely to differ among subgroups. Hence, fully revealing the composite mechanisms underlying multi-block data is challenging, since proper clustering analysis of such data requires methods that simultaneously detect the covariation of variables underlying all data blocks and the group differences therein. Additionally, the methods should be able to handle high-dimensional datasets, which might include many irrelevant variables. Here, we present Clusterwise Sparse Simultaneous Component Analysis (CSSCA), a method that groups the subjects that are driven by the same mechanisms and, at the same time, extracts cluster-specific components that model these mechanisms. By imposing structure constraints, CSSCA further distinguishes common mechanisms that underlie all data blocks from distinctive mechanisms that only underlie one or a few data blocks. In extensive simulations, CSSCA delivered convincing results in recovering the clusters and their associated component structures across various conditions. More importantly, CSSCA showed a clear advantage over existing methods when substantial cluster differences in the component structure were present. We demonstrated the usefulness of CSSCA in an application to data stemming from a study on personality. clustering data integration high-dimensional data analysis edited-state corrected-proof typesetter ts3 Authors' Note The authors thank the editor and the anonymous reviewers for providing helpful comments on earlier drafts of the article. Michael Dufner is now affiliated with Medical School Berlin, Germany. Data Availability The data used in the simulation can be reproduced by running the simulation R script that is available at https://github.com/syuanuvt/CSSCA under the section Simulation. The application data (i.e., personality data) are available on request from Jaap J. A. Denissen ( [email protected] ). Declaration of Conflicting Interests The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article. Funding The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was funded by a personal grant from The Netherlands Organization for Scientific Research [NWO-Research Talent 406.17.526] awarded to Shuai Yuan. Software Information The simulation and the empirical were conducted using the R software for statistical computing. The scripts of the analysis are available at https://github.com/syuanuvt/CSSCA . There, users can also freely download the R package ClusterSSCA, which implements the CSSCA algorithm. Supplemental Material The online supplement to the article is available on PsychArchives at the following address http://dx.doi.org/10.23668/psycharchives.2601

Funding Information:
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was funded by a personal grant from The Netherlands Organization for Scientific Research [NWO-Research Talent 406.17.526] awarded to Shuai Yuan.

Publisher Copyright:
© The Author(s) 2019.

Funding

Shuai Yuan is a PhD student working at the Department of Methodology and Statistics, Tilburg University. His doctoral project aims to develop new big data analytical methods for social and behavioral sciences. Kim De Roover works as an assistant professor at the Department of Methodology and Statistics, Tilburg University. In her research, she combines component or factor analysis with clustering techniques to obtain hybrid methods for capturing heterogeneity in underlying covariance structure or measurement models of variables. She can be reached at [email protected] Michael Dufner is a personality psychologist working at Medical School Berlin. His research examines topics such as self-perception, implicit personality, and social relations. He can be reached at [email protected] Jaap J. A. Denissen works as a full professor at the Department of Developmental Psychology of Tilburg University. His broad research interests lie in various areas of personality psychology. He can be reached at [email protected] Katrijn Van Deun works as an associate professor at the Department of Methodology and Statistics, Tilburg University. Her research focuses on the development of novel methods for exploration and prediction with high-dimensional multi-block data. She can be reached at [email protected] 1 Tilburg University, Tilburg, The Netherlands 2 University of Leipzig, Germany Shuai Yuan, Tilburg University, Warandelaan 2, Tilburg, The Netherlands. Email: [email protected] This article is part of the SSCR special issue on “Big Data in the Behavioral and Social Sciences”, guest edited by Michael Bosnjak (Leibniz Institute for Psychology Information. Trier, Germany). 2019 0894439319888449 © The Author(s) 2019 2019 SAGE Publications This article is distributed under the terms of the Creative Commons Attribution-NonCommercial 4.0 License ( http://www.creativecommons.org/licenses/by-nc/4.0/ ) which permits non-commercial use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access pages ( https://us.sagepub.com/en-us/nam/open-access-at-sage ). Social and behavioral studies more and more often yield multi-block data, which consist of novel blocks of data (e.g., data from wearable devices) and traditional blocks of data (e.g., survey data) collected from the same sample. Multi-block data offer researchers valuable insights into complex social mechanisms, where several influences act together. Yet such mechanisms are likely to differ among subgroups. Hence, fully revealing the composite mechanisms underlying multi-block data is challenging, since proper clustering analysis of such data requires methods that simultaneously detect the covariation of variables underlying all data blocks and the group differences therein. Additionally, the methods should be able to handle high-dimensional datasets, which might include many irrelevant variables. Here, we present Clusterwise Sparse Simultaneous Component Analysis (CSSCA), a method that groups the subjects that are driven by the same mechanisms and, at the same time, extracts cluster-specific components that model these mechanisms. By imposing structure constraints, CSSCA further distinguishes common mechanisms that underlie all data blocks from distinctive mechanisms that only underlie one or a few data blocks. In extensive simulations, CSSCA delivered convincing results in recovering the clusters and their associated component structures across various conditions. More importantly, CSSCA showed a clear advantage over existing methods when substantial cluster differences in the component structure were present. We demonstrated the usefulness of CSSCA in an application to data stemming from a study on personality. clustering data integration high-dimensional data analysis edited-state corrected-proof typesetter ts3 Authors' Note The authors thank the editor and the anonymous reviewers for providing helpful comments on earlier drafts of the article. Michael Dufner is now affiliated with Medical School Berlin, Germany. Data Availability The data used in the simulation can be reproduced by running the simulation R script that is available at https://github.com/syuanuvt/CSSCA under the section Simulation. The application data (i.e., personality data) are available on request from Jaap J. A. Denissen ( [email protected] ). Declaration of Conflicting Interests The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article. Funding The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was funded by a personal grant from The Netherlands Organization for Scientific Research [NWO-Research Talent 406.17.526] awarded to Shuai Yuan. Software Information The simulation and the empirical were conducted using the R software for statistical computing. The scripts of the analysis are available at https://github.com/syuanuvt/CSSCA . There, users can also freely download the R package ClusterSSCA, which implements the CSSCA algorithm. Supplemental Material The online supplement to the article is available on PsychArchives at the following address http://dx.doi.org/10.23668/psycharchives.2601

Keywords

  • clustering
  • data integration
  • high-dimensional data analysis

Fingerprint

Dive into the research topics of 'Revealing Subgroups That Differ in Common and Distinctive Variation in Multi-Block Data: Clusterwise Sparse Simultaneous Component Analysis'. Together they form a unique fingerprint.

Cite this