Some studies in correspondence analysis of texts

Research output: ThesisDoctoral thesis 1 (Research UU / Graduation UU)

Abstract

In text mining and natural language processing (NLP) applications, a vector representation of text data is the key in designing an effective machine learning algorithm. Document-term and word-context matrices are two important matrices to represent texts as vectors. These two matrices are usually sparse and high-dimensional. The process of creating low-dimensional representations of texts is referred to as dimensionality reduction. Dimensionality reduction is associated with the representation of text data and thus very important. In the machine learning literature, little to no attention has been paid to a popular statistical technique, correspondence analysis (CA). Other popular dimensionality reduction methods receive more attention, like latent semantic analysis (LSA). This project is to study whether CA is a good dimensionality reduction technique in text mining and NLP. Chapter 2 theoretically compares CA and LSA of a document-term matrix. In addition, the performance of CA is compared to the performance of different versions of LSA in the context of text categorization and authorship attribution. The criterion used to make comparisons is mainly a measure for accuracy. From a theoretical point of view it appears that CA has more attractive properties than LSA. For example, in LSA, the effect of the margins as well as the dependence between documents and terms is part of the matrix that is analyzed, while CA eliminates the effect of the margins and thus the solution only displays the dependence. The results for four empirical datasets show that CA can obtain higher accuracies on text categorization and authorship attribution than the different versions of LSA. Chapter 3 also studies the performance of CA and LSA in the context of documentterm matrices. CA and LSA are empirically compared in information retrieval by calculating the mean average precision. An attempt is made to improve CA by applying the two kinds of weighting, that are also used in LSA. These are weighting schemes for the elements of the document-term matrix and the adjustment of the singular value weighting exponent. The results for four empirical datasets show that CA always performs better than LSA. Weighting the elements of the raw data matrix can improve CA; however, it is data dependent and the improvement is small. Adjusting the singular value weighting exponent often improves the performance of CA; however, the extent of the improvement depends on the dataset and the number of dimensions. Chapter 4 compares CA with PPMI-SVD, GloVe, and SGNS. Theoretically, like PPMI-SVD, GloVe, and SGNS, we are able to link CA to the factorization of the PMI matrix. An attempt is made to improve CA by making use of weighting schemes for the elements of the word-context matrix. An empirical comparison on word similarity tasks shows that the overall results for CA with the two weighting schemes are slightly better than those of PPMI-SVD, GloVe, and SGNS. CA is susceptible to outliers. In Chapter 5, the so-called reconstitution algorithm is introduced to cope with outlying cells. This algorithm can reduce the contribution of the outlying cells in CA. The reconstitution algorithm is compared with two alternative methods for handling outliers, the supplementary points method and MacroPCA. It is shown that the proposed strategy works well. Summarizing, we have shown that CA is a technique that matches or outperforms techniques that are now commonly used in computing science. We think that the performance of CA in the studies of this dissertation shows that CA deserves more attention in this field.
Original languageEnglish
QualificationDoctor of Philosophy
Awarding Institution
  • Utrecht University
Supervisors/Advisors
  • van der Heijden, Peter, Supervisor
  • Oberski, Daniel, Supervisor
  • Hessen, David, Co-supervisor
Award date18 Oct 2024
Publisher
Print ISBNs978-90-393-7739-0
DOIs
Publication statusPublished - 18 Oct 2024

Keywords

  • Statistical method
  • Text mining
  • Natural language processing
  • Correspondence analysis
  • Latent semantic analysis
  • Dimensionality reduction
  • Word2Vec
  • GloVe
  • Outliers

Fingerprint

Dive into the research topics of 'Some studies in correspondence analysis of texts'. Together they form a unique fingerprint.

Cite this