Skip to main navigation Skip to search Skip to main content

Links in Large Integrated Knowledge Graphs: Analysis, Refinement, and Domain Applications

  • extern

Research output: ThesisDoctoral thesis 4 (Research NOT UU / Graduation NOT UU)

Abstract

This thesis focuses on a specific knowledge representation format known as knowledge graphs, where nodes represent entities and edges denote relations. Integrating knowledge graphs can result in richer resources but also lead to undesirable structures and even logical inconsistencies. Therefore, refinement methods that detect and correct such issues are essential. Scale matters. Problems that are easy for small knowledge graphs can become significantly more challenging at scale. Addressing these challenges requires data analysis, algorithm development, and rigorous evaluation. This thesis investigates key issues in large, integrated knowledge graphs—such as identity, error sources, and knowledge evolution. Tools used for analysis and refinement take advantage of graph theory, automated reasoning, and more. Transitive relations are ubiquitous in knowledge graphs—examples include class subsumption, part-whole hierarchies, and concept specification. However, transitivity can propagate small errors far beyond their local contexts as a result of integration. We extend our investigation to relations that are intended to be both transitive and antisymmetric, even if not formally defined. We refer to these as pseudo-transitive relations. Chapter 2 introduces an algorithm and corresponding benchmarks comprising several graphs of transitive and pseudo-transitive relations, complete with hand-labeled gold standards and baseline methods. We propose new analytical measures and introduce an algorithm for refining knowledge graphs with such relations. Our algorithm takes advantage of graph structures. Traditionally, repeated statements are treated as logically equivalent and are discarded during integration. However, it is possible to track how many source graphs assert each statement, interpreted as weights. Building on the intuition that statements supported by more sources are more likely to be correct, we extend our algorithm with a weighting scheme that heuristically identifies and removes edges to achieve acyclicity. A special case of transitive relations is the identity relation, which asserts that two entities refer to the same concept. The subgraph of these assertions is known as the identity graph. Chapter 3 focuses on refining such graphs. Determining the correct representation of a concept—especially when modeled as a cluster of interlinked entities—can be challenging. Errors here may result in falsely merged clusters of unrelated entities. Typically, we assume that each dataset represents each concept with a single entity—this is known as the Unique Name Assumption (UNA). In practice, however, this assumption often fails. Identity assertions frequently involve entities representing different versions, languages, or encodings. To account for this for large integrated knowledge graphs, we define a relaxed assumption called internal UNA (iUNA). Based on this notion, we develop a new algorithm for detecting and eliminating erroneous identity statements. In Chapter 4, we study the evolution and dynamics of knowledge graphs by analyzing entity redirections and the chains they form. We classify different redirection scenarios and estimate the proportion of redirects that can be interpreted as identity links. Additionally, we analyze the statistical and graph-theoretic properties of redirection graphs. Chapter 5 turns to a domain-specific application. We select and integrate multiple knowledge graphs from the domains of economics, finance, and banking. Through statistical and graph-theoretic analysis, we demonstrate how integration yields entities with richer, more complete information. The quality of the integrated graph is evaluated by analyzing subgraphs formed by identity and (pseudo-)transitive relations. We also study the sources of errors and explore methods for their refinement, highlighting the benefits of our integration approach. Chapter 6 explores another domain-specific application, focusing on LGBTQ+ entities and relations. We construct a knowledge graph about identity-related relations in this domain and analyze its properties. We show how challenges such as multilingualism, conceptual drift, and linguistic ambiguity significantly increase complexity, amplifying issues previously observed. We demonstrate how our knowledge graph can be used to address these problems.
Original languageEnglish
Awarding Institution
  • Vrije Universiteit Amsterdam
Supervisors/Advisors
  • van Harmelen, Frank, Supervisor, External person
  • Bloem, Peter, Co-supervisor, External person
  • Raad, Joe, Co-supervisor, External person
Award date17 Dec 2025
Place of PublicationAmsterdam
Publisher
Print ISBNs9789464739848
DOIs
Publication statusPublished - 17 Dec 2025
Externally publishedYes

Keywords

  • Knowledge graph
  • semantic web
  • integrated knowledge graphs
  • identity graphs
  • artificial intelligence
  • LGBTQ+

Fingerprint

Dive into the research topics of 'Links in Large Integrated Knowledge Graphs: Analysis, Refinement, and Domain Applications'. Together they form a unique fingerprint.

Cite this