Abstract
Developing effective spoken language processing systems for low-resource languages poses several challenges due to the lack of parallel data and limited resources for fine-tuning models. In this work, we target on improving upon both text classification and translation of Nigerian Pidgin (Naija) by collecting a large-scale parallel English-Pidgin corpus and further propose a framework of cross-lingual adaptive training that includes both continual and task adaptive training so as to adapt a base pre-trained model to low-resource languages. Our studies show that English pre-trained language models serve as a stronger prior than multilingual language models on English-Pidgin tasks with up to 2.38 BLEU improvements; and demonstrate that augmenting orthographic data and using task adaptive training with back-translation can have a significant impact on model performance.
Original language | English |
---|---|
Title of host publication | Proceedings of the 24th INTERSPEECH conference |
Pages | 3954-3958 |
Number of pages | 5 |
Volume | 2023-August |
DOIs | |
Publication status | Published - Sept 2023 |
Publication series
Name | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH |
---|---|
ISSN (Print) | 2308-457X |
Bibliographical note
Publisher Copyright:© 2023 International Speech Communication Association. All rights reserved.
Funding
This work was supported by the Deutsche Forschungsgemein-schaft, Funder Id: http://dx.doi.org/10.13039/ 501100001659, Grant Number: SFB1102: Information Density and Linguistic Encoding.
Funders | Funder number |
---|---|
Deutsche Forschungsgemeinschaft | SFB1102 |
Keywords
- low-resource language
- low-resource machine translation
- spoken language understanding