TY - GEN
T1 - Finding Synonymous Attributes in Evolving Wikipedia Infoboxes
AU - Sottovia, Paolo
AU - Paganelli, Matteo
AU - Guerra, Francesco
AU - Velegrakis, Yannis
PY - 2019/1/1
Y1 - 2019/1/1
N2 - Wikipedia Infoboxes are semi-structured data structures organized in an attribute-value fashion. Policies establish for each type of entity represented in Wikipedia the attribute names that the Infobox should contain in the form of a template. However, these requirements change over time and often users choose not to strictly obey them. As a result, it is hard to treat in an integrated way the history of the Wikipedia pages, making it difficult to analyze the temporal evolution of Wikipedia entities through their Infobox and impossible to perform direct comparison of entities of the same type. To address this challenge, we propose an approach to deal with the misalignment of the attribute names and identify clusters of synonymous Infobox attributes. Elements in the same cluster are considered as a temporal evolution of the same attribute. To identify the clusters we use two different distance metrics. The first is the co-occurrence degree that is treated as a negative distance, and the second is the co-occurrence of similar values in the attributes that are treated as a positive evidence of synonymy. We formalize the problem as a correlation clustering problem over a weighted graph constructed with attributes as nodes and positive and negative evidence as edges. We solve it with a linear programming model that shows a good approximation. Our experiments over a collection of Infoboxes of the last 13 years shows the potential of our approach.
AB - Wikipedia Infoboxes are semi-structured data structures organized in an attribute-value fashion. Policies establish for each type of entity represented in Wikipedia the attribute names that the Infobox should contain in the form of a template. However, these requirements change over time and often users choose not to strictly obey them. As a result, it is hard to treat in an integrated way the history of the Wikipedia pages, making it difficult to analyze the temporal evolution of Wikipedia entities through their Infobox and impossible to perform direct comparison of entities of the same type. To address this challenge, we propose an approach to deal with the misalignment of the attribute names and identify clusters of synonymous Infobox attributes. Elements in the same cluster are considered as a temporal evolution of the same attribute. To identify the clusters we use two different distance metrics. The first is the co-occurrence degree that is treated as a negative distance, and the second is the co-occurrence of similar values in the attributes that are treated as a positive evidence of synonymy. We formalize the problem as a correlation clustering problem over a weighted graph constructed with attributes as nodes and positive and negative evidence as edges. We solve it with a linear programming model that shows a good approximation. Our experiments over a collection of Infoboxes of the last 13 years shows the potential of our approach.
KW - Evolving data
KW - Temporal schema matching
KW - Wikipedia
UR - http://www.scopus.com/inward/record.url?scp=85072840003&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-28730-6_11
DO - 10.1007/978-3-030-28730-6_11
M3 - Conference contribution
AN - SCOPUS:85072840003
SN - 9783030287290
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 169
EP - 185
BT - Advances in Databases and Information Systems
A2 - Welzer, Tatjana
A2 - Eder, Johann
A2 - Podgorelec, Vili
A2 - Kamišalic Latific, Aida
PB - Springer
CY - Cham
T2 - 23rd European Conference on Advances in Databases and Information Systems, ADBIS 2019
Y2 - 8 September 2019 through 11 September 2019
ER -