Decoding 16th-Century Letters: From Topic Models to GPT-Based Keyword Mapping

Phillip Ströbel*, Stefan Aderhold, Ramona Roller

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingChapterAcademicpeer-review

Abstract

Probabilistic topic models for categorising or exploring large text corpora are notoriously difficult to interpret. Making sense of them has thus justifiably been compared to “readding tea leaves.” Involving humans in labelling topics consisting of words is feasible but time-consuming, especially if one infers many topics from a text collection. Moreover, it is a coggnitively demanding task, and domain knowledge might be required depending on the text corpus. We thus examine how using a Large Language Model (LLM) offers support in text classification. We compare how the LLM summarises topics produced by Latent Dirichlet Allocation, Non-negative Matrix Factorisation and BERTopic. We investigate which topic modelling technique provides the best representations by applying these models to a 16th-century correspondence corpus in Latin and
Early New High German and inferring keywords from the topics in a low-resource setting. We experiment with including domain knowledge in the form of already existing keyword lists. Our main findings are that the LLM alone provides usable topics already. However, guiding the LLM towards what is expected benefits the interpretability. We further want to highlight that using nouns and proper nouns only makes for good topic representations.
Original languageEnglish
Title of host publication20th Conference on Natural Language Processing (KONVENS 2024)
EditorsPedro Henrique Luz de Araujo, Andreas Baumann, Dagmar Gromann, Brigitte Krenn, Benjamin Roth, Michael Wiegand
Place of PublicationViennna
PublisherAssociation for Computational Linguistics (ACL)
Pages209–221
Number of pages11
Publication statusPublished - 10 Sept 2024

Fingerprint

Dive into the research topics of 'Decoding 16th-Century Letters: From Topic Models to GPT-Based Keyword Mapping'. Together they form a unique fingerprint.

Cite this