Clustering Monolingual Vocabularies to Improve Cross-Lingual Generalization

Riccardo Bassani, Anders S{\o}gaard, Riccardo Bassani

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review


Multilingual language models exhibit better performance for some languages than for others (Singh et al., 2019), and many languages do not seem to benefit from multilingual sharing at all, presumably as a result of poor multilingual segmentation (Pyysal o et al., 2020). This work explores the idea of learning multilingual language models based on clustering of monolingual segments. We show significant improvements over standard multilingual segmentation and training across nine languages on a question answering task, both in a small model regime and for a model of the size of BERT-base.
Original languageEnglish
Title of host publicationProceedings of the Conference on Empirical Methods in Natural Language Processing
Subtitle of host publication1st Workshop on Multilingual Representation Learning
PublisherAssociation for Computational Linguistics (ACL)
Number of pages8
Publication statusPublished - Nov 2021


Dive into the research topics of 'Clustering Monolingual Vocabularies to Improve Cross-Lingual Generalization'. Together they form a unique fingerprint.

Cite this