Topic Discovery from Textual Data: Machine Learning and Natural Language Processing for Knowledge Discovery in the Fisheries Domain

S Syed

    Research output: ThesisDoctoral thesis 1 (Research UU / Graduation UU)

    Abstract

    It is estimated that the world’s data will increase to roughly 160 billion terabytes by 2025, with most of that data occurring in an unstructured form. Today, we have already reached the point where more data is being produced than can be physically stored. To ingest all this data and to construct valuable knowledge from it, new computational tools and algorithms are needed, especially since manual probing of the data is slow, expensive, and subjective.
    For unstructured data, such as text in documents, an ongoing field of research is probabilistic topic models. Topic models are techniques to automatically uncover the hidden or latent topics present within a collection of documents. Topic models can infer the topical content of thousands or millions of documents without prior labeling or annotation. This unsupervised nature makes probabilistic topic models a useful tool for applied data scientists to interpret and examine large volumes of documents for extracting new and valuable knowledge.
    This dissertation scientifically investigates how to optimally and efficiently apply and interpret topic models to large collections of documents. Specifically, it shows how different types of textual data, pre-processing steps, and hyper-parameter settings can affect the quality of the derived latent topics. The results presented in this dissertation provide a starting point for researchers who want to apply topic models with scientific rigorousness to scientific publications.
    Original languageEnglish
    Awarding Institution
    • Utrecht University
    Supervisors/Advisors
    • Brinkkemper, Sjaak, Primary supervisor
    • Spruit, M.R., Co-supervisor
    Award date20 Mar 2019
    Publisher
    Print ISBNs978-90-393-7086-5
    Publication statusPublished - 20 Mar 2019

    Keywords

    • Topic modeling
    • Latent Dirichlet Allocation
    • machine learning
    • fisheries science

    Fingerprint

    Dive into the research topics of 'Topic Discovery from Textual Data: Machine Learning and Natural Language Processing for Knowledge Discovery in the Fisheries Domain'. Together they form a unique fingerprint.

    Cite this