Abstract
The semantic similarity between documents of a text corpus can be visualized using map-like metaphors based on twodimensional scatterplot layouts. These layouts result from a dimensionality reduction on the document-term matrix or a representation within a latent embedding, including topic models. Thereby, the resulting layout depends on the input data and hyperparameters of the dimensionality reduction and is therefore affected by changes in them. Furthermore, the resulting layout is affected by changes in the input data and hyperparameters of the dimensionality reduction. However, such changes to the layout require additional cognitive efforts from the user. In this work, we present a sensitivity study that analyzes the stability of these layouts concerning (1) changes in the text corpora, (2) changes in the hyperparameter, and (3) randomness in the initialization. Our approach has two stages: data measurement and data analysis. First, we derived layouts for the combination of three text corpora and six text embeddings and a grid-search-inspired hyperparameter selection of the dimensionality reductions. Afterward, we quantified the similarity of the layouts through ten metrics, concerning local and global structures and class separation. Second, we analyzed the resulting 42 817 tabular data points in a descriptive statistical analysis. From this, we derived guidelines for informed decisions on the layout algorithm and highlight specific hyperparameter settings. We provide our implementation as a Git repository at hpicgs/Topic-Models-and-DimensionalityReduction-Sensitivity-Study and results as Zenodo archive at DOI:10.5281/zenodo.12772898.
Original language | English |
---|---|
Pages (from-to) | 305-315 |
Journal | IEEE Transactions on Visualization and Computer Graphics |
Volume | 31 |
Issue number | 1 |
Early online date | 17 Sept 2024 |
DOIs | |
Publication status | Published - Jan 2025 |
Bibliographical note
Publisher Copyright:© 1995-2012 IEEE.
Funding
We thank the reviewers for their valuable feedback. This work was partially funded by the Federal Ministry of Education and Research, Germany through grant 01IS22062 and project 16KN086467 funded by the Federal Ministry for Economic Affairs and Climate Action of Germany. The work of Tobias Schreck was partially funded by the Austrian Research Promotion Agency (FFG) within the framework of the flagship project ICT of the Future PRESENT, grant FO999899544.
Funders | Funder number |
---|---|
Bundesministerium für Wirtschaft und Klimaschutz | |
Bundesministerium für Bildung und Forschung | 16KN086467, 01IS22062 |
Österreichische Forschungsförderungsgesellschaft | FO999899544 |
Keywords
- benchmarking
- dimensionality reductions
- stability
- text embeddings
- Text spatializations
- topic modeling