The more the better? The effect of domain-specific dataset on entity extraction from Dutch criminal records

Amber Norder, Gizem Sogancioglu, Heysem Kaya

Research output: Contribution to conferencePaperAcademic

Abstract

The Dutch police force generates very high amounts of documents such as transcripts of interrogations, evidence findings, statements of people involved, all of which need to be read and processed by analysts. Automating the entity extraction in the documents would greatly help the police force. Neural network-based approaches using contextual word embeddings are considered the current state-of-the-art approach to tackle the named entity recognition (NER) problem in the Dutch. There are available domain-independent NER datasets in the literature well as pre-trained NER models. However, earlier studies show that domain-independent models do not work well for domain-specific tasks. As annotation is highly costly, in this study, we train a set of BERTje embeddings based NER models with the varying size of police dataset in addition to the domain-independent set to observe the effect of domain-specific dataset in the training. We follow a training, validation, and test split to ensure a proper experimental protocol. We observe that the slope of the performance increase is decreasing with the number of target domain documents in the training set and stabilizes on the validation set around 250-300 documents. The NER system has a better performance on the held-out test set (85\% macro-average F1 score over five entity categories) compared to the validation set, showing the generalization power of the investigated framework.
Original languageEnglish
Number of pages6
Publication statusPublished - 2021
EventThe 31st Meeting of Computational Linguistics in The Netherlands - (online), Ghent, Belgium
Duration: 9 Jul 20219 Jul 2021
https://www.clin31.ugent.be/

Conference

ConferenceThe 31st Meeting of Computational Linguistics in The Netherlands
Abbreviated titleCLIN 31
Country/TerritoryBelgium
CityGhent
Period9/07/219/07/21
Internet address

Keywords

  • Natural Language Processing
  • Named Entity Recognition
  • Coreference Resolution
  • Dutch NLP

Fingerprint

Dive into the research topics of 'The more the better? The effect of domain-specific dataset on entity extraction from Dutch criminal records'. Together they form a unique fingerprint.

Cite this