Abstract
Creating meaningful text embeddings using BERT-based language models involves pre-training on large amounts of data. For domain-specific use cases where data is scarce (e.g., the law enforcement domain) it might not be feasible to pre-train a whole new language model. In this paper, we examine how extending BERT-based tokenizers and further pre-training BERT-based models can benefit downstream classification tasks. As a proxy for domain-specific data, we use the European Convention of Human Rights (ECtHR) dataset. We find that for down-stream tasks, further pre-training a language model on a small domain dataset can rival models that are completely retrained on large domain datasets. This indicates that completely retraining a language model may not be necessary to improve down-stream task performance. Instead, small adaptions to existing state-of-the-art language models like BERT may suffice.
Original language | English |
---|---|
Title of host publication | Proceedings of the 6th Workshop on Automated Semantic Analysis of Information in Legal Text co-located with the 19th International Conference on Artificial Intelligence and Law (ICAIL 2023), Braga, Portugal, 23rd September, 2023 |
Editors | Francesca Lagioia, Jack Mumford, Daphne Odekerken, Hannes Westermann |
Publisher | CEUR WS |
Pages | 13-18 |
Number of pages | 6 |
Volume | 3441 |
Publication status | Published - 2023 |
Event | 19th International Conference on Artificial Intelligence and Law - University of Minho, Braga, Portugal Duration: 19 Jun 2023 → 23 Jun 2023 Conference number: 19 https://icail2023.di.uminho.pt/ |
Publication series
Name | CEUR Workshop Proceedings |
---|---|
ISSN (Print) | 1613-0073 |
Conference
Conference | 19th International Conference on Artificial Intelligence and Law |
---|---|
Abbreviated title | ICAIL |
Country/Territory | Portugal |
City | Braga |
Period | 19/06/23 → 23/06/23 |
Internet address |
Keywords
- Transformers
- BERT
- Language Models
- Legal Text Classification
- ECtHR dataset
- Text Embeddings