Skip to main navigation Skip to search Skip to main content

Say the Same but Differently: Computational Approaches to Stylistic Variation and Paraphrasing

    Research output: ThesisDoctoral thesis 1 (Research UU / Graduation UU)

    Abstract

    Consider two Dutch sentences "Ik ben een Utrechter" and "Ik ben een Utrechtenaar". Even thought their surface level presentation is different, a translation tool like DeepL might translate both of these sentences to "I am an Utrecht resident". This translation is perfectly reasonable as both "Utrechter" and "Utrechtenaar" refer to an inhabitant of the city of Utrecht. In this case, DeepL can be said to be robust to language variation: it treats both statements equally. However, there are also many cases in which NLP models benefit from being sensitive to language variation. The Utrecht example illustrates this: Historically, "Utrechtenaar" was the more common term. However, it has now been largely replaced by "Utrechter" in everyday language, as "Utrechtenaar" has been associated with gay men since the Utrecht sodomy trials (around 1730). Today, when someone uses "Utrechtenaar" over "Utrechter" to refer to themselves, we might know more about them–for example, that they are more likely part of the local queer community. Let's imagine a newspaper article in which two people refer to themselves as "Utrechter" and "Utrechtenaar": translating both terms as "resident of Utrecht" could obscure subtle differences in background and social identity—potentially leading to confusion or a loss of narrative nuance. In this dissertation, I develop methods to make language models both more sensitive and more robust to language variation. In Chapter 3, I examine tokenizers–a fundamental building block of language models–with respect to their sensitivity and robustness to language variation. I show that it is important to take language variation into account at all stages of language model development. In Chapters 4 and 5, I develop vector representations that are sensitive to one particular aspect of language variation: the style of a text. In Chapter 4, I propose the STyle Evaluation Framework (STEL), the first systematic method for evaluating how sensitive NLP methods are to stylistic variation in text. In Chapter 5, I train neural text representations that–unlike previous approaches–capture linguistic style independently of content and achieve strong results on STEL. These resulting vector representations have already found a wide range of applications in the NLP community. In Chapter 6, I introduce a novel task: the detection of cross-speaker paraphrases in dialogue. For this, I train crowdworkers using my own iterative procedure for classifying paraphrases. My results show that both humans and NLP models face considerable challenges in robustly recognizing utterances that vary linguistically but have the same content. I hope that this work will encourage the NLP community to take language variation into account more when developing NLP methods.
    Original languageEnglish
    QualificationDoctor of Philosophy
    Awarding Institution
    • Utrecht University
    Supervisors/Advisors
    • van Deemter, Kees, Supervisor
    • Nguyen, Dong, Co-supervisor
    Award date13 Oct 2025
    Publisher
    Print ISBNs978-90-393-7935-6
    DOIs
    Publication statusPublished - 13 Oct 2025

    Keywords

    • natural language processing
    • language variation
    • paraphrases in dialog
    • linguistic style
    • style embedding
    • tokenizer
    • evaluation

    Fingerprint

    Dive into the research topics of 'Say the Same but Differently: Computational Approaches to Stylistic Variation and Paraphrasing'. Together they form a unique fingerprint.

    Cite this