Abstract
The Dutch police force generates very high amounts of documents such as transcripts of interrogations, evidence findings, statements of people involved, all of which need to be read and processed by analysts. Automating the entity extraction in the documents would greatly help the police force. Neural network-based approaches using contextual word embeddings are considered the current state-of-the-art approach to tackle the named entity recognition (NER) problem in the Dutch. There are available domain-independent NER datasets in the literature well as pre-trained NER models. However, earlier studies show that domain-independent models do not work well for domain-specific tasks. As annotation is highly costly, in this study, we train a set of BERTje embeddings based NER models with the varying size of police dataset in addition to the domain-independent set to observe the effect of domain-specific dataset in the training. We follow a training, validation, and test split to ensure a proper experimental protocol. We observe that the slope of the performance increase is decreasing with the number of target domain documents in the training set and stabilizes on the validation set around 250-300 documents. The NER system has a better performance on the held-out test set (85\% macro-average F1 score over five entity categories) compared to the validation set, showing the generalization power of the investigated framework.
Original language | English |
---|---|
Number of pages | 6 |
Publication status | Published - 2021 |
Event | The 31st Meeting of Computational Linguistics in The Netherlands - (online), Ghent, Belgium Duration: 9 Jul 2021 → 9 Jul 2021 https://www.clin31.ugent.be/ |
Conference
Conference | The 31st Meeting of Computational Linguistics in The Netherlands |
---|---|
Abbreviated title | CLIN 31 |
Country/Territory | Belgium |
City | Ghent |
Period | 9/07/21 → 9/07/21 |
Internet address |
Keywords
- Natural Language Processing
- Named Entity Recognition
- Coreference Resolution
- Dutch NLP