Abstract
Creating meaningful text embeddings using BERT-based language models involves pre-training on large amounts of data. For domain-specific use cases where data is scarce (e.g., the law enforcement domain) it might not be feasible to pre-train a whole new language model. In this paper, we examine how extending BERT-based tokenizers and further pre-training BERT-based models can benefit downstream classification tasks. As a proxy for domain-specific data, we use the European Convention of Human Rights (ECtHR) dataset. We find that for down-stream tasks, further pre-training a language model on a small domain dataset can rival models that are completely retrained on large domain datasets. This indicates that completely retraining a language model may not be necessary to improve down-stream task performance. Instead, small adaptions to existing state-of-the-art language models like BERT may suffice.
| Original language | English |
|---|---|
| Title of host publication | Proceedings of the 6th Workshop on Automated Semantic Analysis of Information in Legal Text co-located with the 19th International Conference on Artificial Intelligence and Law (ICAIL 2023), Braga, Portugal, 23rd September, 2023 |
| Editors | Francesca Lagioia, Jack Mumford, Daphne Odekerken, Hannes Westermann |
| Publisher | CEUR WS |
| Pages | 13-18 |
| Number of pages | 6 |
| Volume | 3441 |
| Publication status | Published - 2023 |
| Event | 19th International Conference on Artificial Intelligence and Law - University of Minho, Braga, Portugal Duration: 19 Jun 2023 → 23 Jun 2023 Conference number: 19 https://icail2023.di.uminho.pt/ |
Publication series
| Name | CEUR Workshop Proceedings |
|---|---|
| ISSN (Print) | 1613-0073 |
Conference
| Conference | 19th International Conference on Artificial Intelligence and Law |
|---|---|
| Abbreviated title | ICAIL |
| Country/Territory | Portugal |
| City | Braga |
| Period | 19/06/23 → 23/06/23 |
| Internet address |
Bibliographical note
Publisher Copyright:© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
Keywords
- Transformers
- BERT
- Language Models
- Legal Text Classification
- ECtHR dataset
- Text Embeddings