Identification of rare & novel senses using translations in a parallel corpus

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

Abstract

The identification of rare and novel senses is a challenge in lexicography. In this paper, we present a new method for finding such senses using a word aligned multilingual parallel corpus. We use the Europarl corpus and therein concentrate on French verbs. We represent each occurrence of a French verb as a high dimensional term vector. The dimensions of such a vector are the possible translations of the verb according to the underlying word alignment. The dimensions are weighted by a weighting scheme to adjust to the significance of any particular translation. After collecting these vectors we apply forms of the K-means algorithm on the resulting vector space to produce clusters of distinct senses, so that standard uses produce large homogeneous clusters while rare and novel uses appear in small or heterogeneous clusters. We show in a qualitative and quantitative evaluation that the method can successfully find rare and novel senses.

Original languageEnglish
Title of host publicationProceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
EditorsDaniel Tapias, Irene Russo, Olivier Hamon, Stelios Piperidis, Nicoletta Calzolari, Khalid Choukri, Joseph Mariani, Helene Mazo, Bente Maegaard, Jan Odijk, Mike Rosner
PublisherEuropean Language Resources Association (ELRA)
Pages2249-2252
Number of pages4
ISBN (Electronic)2951740867, 9782951740860
Publication statusPublished - 2010
Externally publishedYes
Event7th International Conference on Language Resources and Evaluation, LREC 2010 - Valletta, Malta
Duration: 17 May 201023 May 2010

Conference

Conference7th International Conference on Language Resources and Evaluation, LREC 2010
Country/TerritoryMalta
CityValletta
Period17/05/1023/05/10

Fingerprint

Dive into the research topics of 'Identification of rare & novel senses using translations in a parallel corpus'. Together they form a unique fingerprint.

Cite this