Efficient and General Text Classification: An Active Learning Approach Using Active Learning and NLP to Aid Processes Such as Journalistic Investigations And document Analysis

Micha van Grinsven*, Matthieu Brinkhuis, Georg Krempl, Joop Snijder

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

Abstract

Active Learning (AL) has shown advantages over Passive Learning in domains where labeled data is costly to obtain. Nevertheless, it is relatively underused in real-world applications for textual data. In this research, AL is applied to two unbalanced, real-world datasets on the now-defunct energy company Enron and the Dutch oil company Shell. The Enron data is labelled on the presence of information on logistics in documents, whereas the Shell dataset is part of a current investigation by Follow the Money which is a journalism bureau. In this paper, we attempt to aid such a journalistic investigation with an Active Machine Learning approach. This approach assists the investigator (oracle) to identify documents belonging to a storyline in the dataset. The classification of documents is performed by looking only at the textual data in these datasets. As an initial test, the public Enron dataset with its large number of labels is used. Subsequently, the method is used on a real-world application with the Shell dataset. During testing, it is found that the highest F1-score of Passive Learning is matched by an Active learning approach that uses only 42% of the data necessary for Passive Learning. Furthermore, it turns out that by using a combination of Active Learning and Natural Language Processing on the Shell data, an F1-score of 0.87 together with an accuracy of 0.91 can be achieved using only 5% of labeled data with a logistic regression model. This shows that Active Learning can aid in a journalistic investigation and the development of storylines. ASReview is used to facilitate this research. The setup presented in this research could be applied to almost any textual data classification problem.
Original languageEnglish
Title of host publicationMachine Learning and Principles and Practice of Knowledge Discovery in Databases - International Workshops of ECML PKDD 2023, Revised Selected Papers
EditorsRosa Meo, Fabrizio Silvestri
PublisherSpringer Nature
Pages105-120
Number of pages16
ISBN (Electronic)978-3-031-74627-7
ISBN (Print)978-3-031-74626-0
DOIs
Publication statusPublished - 1 Jan 2025
EventInternational Workshops of ECML PKDD 2023 - Turin, Italy
Duration: 18 Sept 202322 Sept 2023

Publication series

NameCommunications in Computer and Information Science
Volume2134 CCIS
ISSN (Print)1865-0929
ISSN (Electronic)1865-0937

Conference

ConferenceInternational Workshops of ECML PKDD 2023
Country/TerritoryItaly
CityTurin
Period18/09/2322/09/23

Bibliographical note

Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.

Keywords

  • ASReview
  • Active Learning
  • Journalistic Investigations
  • Natural Language Processing
  • Pipeline
  • Text classification
  • Text mining

Fingerprint

Dive into the research topics of 'Efficient and General Text Classification: An Active Learning Approach Using Active Learning and NLP to Aid Processes Such as Journalistic Investigations And document Analysis'. Together they form a unique fingerprint.

Cite this