Abstract
Active Learning (AL) has shown advantages over Passive Learning in domains where labeled data is costly to obtain. Nevertheless, it is relatively underused in real-world applications for textual data. In this research, AL is applied to two unbalanced, real-world datasets on the now-defunct energy company Enron and the Dutch oil company Shell. The Enron data is labelled on the presence of information on logistics in documents, whereas the Shell dataset is part of a current investigation by Follow the Money which is a journalism bureau. In this paper, we attempt to aid such a journalistic investigation with an Active Machine Learning approach. This approach assists the investigator (oracle) to identify documents belonging to a storyline in the dataset. The classification of documents is performed by looking only at the textual data in these datasets. As an initial test, the public Enron dataset with its large number of labels is used. Subsequently, the method is used on a real-world application with the Shell dataset. During testing, it is found that the highest F1-score of Passive Learning is matched by an Active learning approach that uses only 42% of the data necessary for Passive Learning. Furthermore, it turns out that by using a combination of Active Learning and Natural Language Processing on the Shell data, an F1-score of 0.87 together with an accuracy of 0.91 can be achieved using only 5% of labeled data with a logistic regression model. This shows that Active Learning can aid in a journalistic investigation and the development of storylines. ASReview is used to facilitate this research. The setup presented in this research could be applied to almost any textual data classification problem.
Original language | English |
---|---|
Title of host publication | Machine Learning and Principles and Practice of Knowledge Discovery in Databases - International Workshops of ECML PKDD 2023, Revised Selected Papers |
Editors | Rosa Meo, Fabrizio Silvestri |
Publisher | Springer Nature |
Pages | 105-120 |
Number of pages | 16 |
ISBN (Electronic) | 978-3-031-74627-7 |
ISBN (Print) | 978-3-031-74626-0 |
DOIs | |
Publication status | Published - 1 Jan 2025 |
Event | International Workshops of ECML PKDD 2023 - Turin, Italy Duration: 18 Sept 2023 → 22 Sept 2023 |
Publication series
Name | Communications in Computer and Information Science |
---|---|
Volume | 2134 CCIS |
ISSN (Print) | 1865-0929 |
ISSN (Electronic) | 1865-0937 |
Conference
Conference | International Workshops of ECML PKDD 2023 |
---|---|
Country/Territory | Italy |
City | Turin |
Period | 18/09/23 → 22/09/23 |
Bibliographical note
Publisher Copyright:© The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.
Keywords
- ASReview
- Active Learning
- Journalistic Investigations
- Natural Language Processing
- Pipeline
- Text classification
- Text mining