Abstract
In the domain of the Dutch cultural heritage various data sets describe different aspects of life during the Dutch Golden Age.These data sets, in the form of RDF graphs, use different standards and contain noise in the values of literal nodes, such as misspelled names and uncertainty in dates. The Golden Agents project aims at answering queries about the Dutch Golden ages using these distributed and independently maintained data sets. A problem in this project, among many other problems, is the identification of persons who occur in multiple data sets but under different URI’s. This paper aims to solve this specific problem and generate a linkset, i.e. a set of pairs of URI’s which are judged to represent the same person. We use domain knowledge in the application of an existing node context generation algorithm to serve as input for GloVe, an algorithm originally designed for embedding words. This embedding is then used to train a classifier on pairs of URI’s which are known duplicates and non-duplicates. Using just the cosine similarity between URI-pairs in embedding space for prediction,we obtain a simple classifier with an F12-score of around 0.85, even when very few training examples are provided. On larger training sets, more complex classifiers are shown to reach an F12-score ofup to 0.88
Original language | English |
---|---|
Pages | 125--133 |
Number of pages | 9 |
DOIs | |
Publication status | Published - 30 Nov 2020 |
Event | International Conference on Information Integration and Web-based Applications & Services - Online due to Covid 19 Duration: 30 Nov 2020 → 2 Dec 2020 Conference number: 22 http://www.iiwas.org/conferences/iiwas2020/ |
Conference
Conference | International Conference on Information Integration and Web-based Applications & Services |
---|---|
Abbreviated title | iiWAS 2020 |
Period | 30/11/20 → 2/12/20 |
Internet address |
Keywords
- RDF
- Cultural Heritage
- Entity Alignment
- Embedding