Towards Meaningful Paragraph Embeddings for Data-Scarce Domains: A Case Study in the Legal Domain

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

Abstract

Creating meaningful text embeddings using BERT-based language models involves pre-training on large amounts of data. For domain-specific use cases where data is scarce (e.g., the law enforcement domain) it might not be feasible to pre-train a whole new language model. In this paper, we examine how extending BERT-based tokenizers and further pre-training BERT-based models can benefit downstream classification tasks. As a proxy for domain-specific data, we use the European Convention of Human Rights (ECtHR) dataset. We find that for down-stream tasks, further pre-training a language model on a small domain dataset can rival models that are completely retrained on large domain datasets. This indicates that completely retraining a language model may not be necessary to improve down-stream task performance. Instead, small adaptions to existing state-of-the-art language models like BERT may suffice.
Original languageEnglish
Title of host publicationProceedings of the 6th Workshop on Automated Semantic Analysis of Information in Legal Text co-located with the 19th International Conference on Artificial Intelligence and Law (ICAIL 2023), Braga, Portugal, 23rd September, 2023
EditorsFrancesca Lagioia, Jack Mumford, Daphne Odekerken, Hannes Westermann
PublisherCEUR WS
Pages13-18
Number of pages6
Volume3441
Publication statusPublished - 2023
Event19th International Conference on Artificial Intelligence and Law - University of Minho, Braga, Portugal
Duration: 19 Jun 202323 Jun 2023
Conference number: 19
https://icail2023.di.uminho.pt/

Publication series

NameCEUR Workshop Proceedings
ISSN (Print)1613-0073

Conference

Conference19th International Conference on Artificial Intelligence and Law
Abbreviated titleICAIL
Country/TerritoryPortugal
CityBraga
Period19/06/2323/06/23
Internet address

Keywords

  • Transformers
  • BERT
  • Language Models
  • Legal Text Classification
  • ECtHR dataset
  • Text Embeddings

Fingerprint

Dive into the research topics of 'Towards Meaningful Paragraph Embeddings for Data-Scarce Domains: A Case Study in the Legal Domain'. Together they form a unique fingerprint.

Cite this