Abstract
Recently, a trend in typology has been to create semantic maps (Haspelmath 1997), not from intuitions and examples, but directly from data extracted from multilingual parallel corpora (Wälchli & Cysouw 2012). In our research, we continue in the same vein, but focusing on the level of grammar instead of the lexical domain. Specifically, we are interested in mapping the PERFECT across five European languages (Dutch, English, French, German, Spanish). We dub our method Translation Mining.
We first extracted present perfects from the EuroParl corpus (Tiedemann 2012) using a methodology that was presented at CLIN26 (van der Klis, Le Bruyn & de Swart 2015). A human annotator (using a web application designed for this purpose) then marked the corresponding verb phrases in the aligned fragments. Tenses of these verb phrases were then automatically or manually assigned, depending on the degree of detail of part-of-speech tags per language.
This process yielded five-tuples of aligned tense attributions. We designed a distance measure to be able to create a (dis)similarity matrix, and then plotted this matrix using multidimensional scaling (MDS). On top of that, we created an interactive visualization that allows researchers to manipulate the dimensions of the MDS algorithm, as well as to inspect the individual data points.
These interactive maps allowed us to reproduce earlier research (e.g. Portner 2003), but also to draw new conclusions of the tense/aspect role of the PERFECT across languages. We repeated the same method on the OpenSubtitles2016 corpus (Lison & Tiedemann 2016) to check for genre variation.
We first extracted present perfects from the EuroParl corpus (Tiedemann 2012) using a methodology that was presented at CLIN26 (van der Klis, Le Bruyn & de Swart 2015). A human annotator (using a web application designed for this purpose) then marked the corresponding verb phrases in the aligned fragments. Tenses of these verb phrases were then automatically or manually assigned, depending on the degree of detail of part-of-speech tags per language.
This process yielded five-tuples of aligned tense attributions. We designed a distance measure to be able to create a (dis)similarity matrix, and then plotted this matrix using multidimensional scaling (MDS). On top of that, we created an interactive visualization that allows researchers to manipulate the dimensions of the MDS algorithm, as well as to inspect the individual data points.
These interactive maps allowed us to reproduce earlier research (e.g. Portner 2003), but also to draw new conclusions of the tense/aspect role of the PERFECT across languages. We repeated the same method on the OpenSubtitles2016 corpus (Lison & Tiedemann 2016) to check for genre variation.
Original language | English |
---|---|
Publication status | Published - 2017 |
Event | Computational Linguistics in the Netherlands - Faculty of Arts (Erasmushuis), Leuven, Belgium Duration: 10 Feb 2017 → 10 Feb 2017 Conference number: 27 http://www.ccl.kuleuven.be/CLIN27/ |
Conference
Conference | Computational Linguistics in the Netherlands |
---|---|
Abbreviated title | CLIN |
Country/Territory | Belgium |
City | Leuven |
Period | 10/02/17 → 10/02/17 |
Internet address |
Keywords
- semantic maps
- perfect
- tense-aspect
- multilingual parallel corpora
- multidimensional scaling