A filter for syntactically incomparable parallel sentences

Martin Kroon, Sjef Barbiers, Jan Odijk, Stéfanie van der Pas

Research output: Chapter in Book/Report/Conference proceedingChapterAcademicpeer-review

Abstract

Massive automatic comparison of languages in parallel corpora will greatly speed up and enhance comparative syntactic research. Automatically extracting and mining syntactic differences from parallel corpora requires a pre-processing step that filters out sentence pairs that cannot be compared syntactically, for example because they involve “free” translations. In this paper we explore four possible filters: the Damerau-Levenshtein distance between POS-tags, the sentence-length ratio, the graph-edit distance between dependency parses, and a combination of the three in a logistic regression model. Results suggest that the dependency-parse filter is the most stable throughout language pairs, while the combination filter achieves the best results
Original languageEnglish
Title of host publicationLinguistics in the Netherlands 2019
EditorsJanine Berns, Elena Tribushinina
Place of PublicationAmsterdam
PublisherJohn Benjamins
Pages147-161
Number of pages15
Publication statusPublished - Dec 2019

Publication series

NameAVT Publications
PublisherJohn Benjamins

Fingerprint

Dive into the research topics of 'A filter for syntactically incomparable parallel sentences'. Together they form a unique fingerprint.

Cite this