A Large-Scale Sensitivity Analysis on Latent Embeddings and Dimensionality Reductions for Text Spatializations

Daniel Atzberger*, Tim Cech, Willy Scheibel, Jurgen Dollner, Michael Behrisch, Tobias Schreck

*Corresponding author for this work

Research output: Contribution to journalArticleAcademicpeer-review

Abstract

The semantic similarity between documents of a text corpus can be visualized using map-like metaphors based on twodimensional scatterplot layouts. These layouts result from a dimensionality reduction on the document-term matrix or a representation within a latent embedding, including topic models. Thereby, the resulting layout depends on the input data and hyperparameters of the dimensionality reduction and is therefore affected by changes in them. Furthermore, the resulting layout is affected by changes in the input data and hyperparameters of the dimensionality reduction. However, such changes to the layout require additional cognitive efforts from the user. In this work, we present a sensitivity study that analyzes the stability of these layouts concerning (1) changes in the text corpora, (2) changes in the hyperparameter, and (3) randomness in the initialization. Our approach has two stages: data measurement and data analysis. First, we derived layouts for the combination of three text corpora and six text embeddings and a grid-search-inspired hyperparameter selection of the dimensionality reductions. Afterward, we quantified the similarity of the layouts through ten metrics, concerning local and global structures and class separation. Second, we analyzed the resulting 42 817 tabular data points in a descriptive statistical analysis. From this, we derived guidelines for informed decisions on the layout algorithm and highlight specific hyperparameter settings. We provide our implementation as a Git repository at hpicgs/Topic-Models-and-DimensionalityReduction-Sensitivity-Study and results as Zenodo archive at DOI:10.5281/zenodo.12772898.

Original languageEnglish
Pages (from-to)305-315
JournalIEEE Transactions on Visualization and Computer Graphics
Volume31
Issue number1
Early online date17 Sept 2024
DOIs
Publication statusPublished - Jan 2025

Bibliographical note

Publisher Copyright:
© 1995-2012 IEEE.

Funding

We thank the reviewers for their valuable feedback. This work was partially funded by the Federal Ministry of Education and Research, Germany through grant 01IS22062 and project 16KN086467 funded by the Federal Ministry for Economic Affairs and Climate Action of Germany. The work of Tobias Schreck was partially funded by the Austrian Research Promotion Agency (FFG) within the framework of the flagship project ICT of the Future PRESENT, grant FO999899544.

FundersFunder number
Bundesministerium für Wirtschaft und Klimaschutz
Bundesministerium für Bildung und Forschung16KN086467, 01IS22062
Österreichische ForschungsförderungsgesellschaftFO999899544

    Keywords

    • benchmarking
    • dimensionality reductions
    • stability
    • text embeddings
    • Text spatializations
    • topic modeling

    Fingerprint

    Dive into the research topics of 'A Large-Scale Sensitivity Analysis on Latent Embeddings and Dimensionality Reductions for Text Spatializations'. Together they form a unique fingerprint.

    Cite this