Adaptive Graphical Approach to Entity Resolution.

Appeared in ACM IEEE Joint Conference on Digital Libraries 2007.

Zhaoqi Chen, Dmitri V. Kalashnikov, and Sharad Mehrotra

Computer Science Department
University of California, Irvine
Entity resolution is a very common Information Quality (IQ) problem with many different applications. In digital libraries, it is related to problems of citation matching and author name disambiguation; in Natural Language Processing, it is related to co-reference matching and object identity; in Web application, it is related to Web page disambiguation. The problem of Entity Resolution arises because objects/entities in real world datasets are often referred to by descriptions, which might not be unique identifiers of these entities, leading to ambiguity. The goal is to group all the entity descriptions that refer to the same real world entities. In this paper we present a graphical approach for entity resolution. It complements the traditional methodology with the analysis of the entity-relationship graph constructed for the dataset being analyzed. The paper demonstrates that a technique that measures the degree of interconnectedness between various pairs of nodes in the graph can significantly improve the quality of entity resolution. Furthermore, the paper presents an algorithm for making that technique self-adaptive to the underlying data, thus minimizing the required participation from the domain-analyst and potentially further improving the disambiguation quality.

Categories and Subject Descriptors:

H.2.m [Database Management]: Miscellaneous - Data cleaning;
H.2.8 [Database Management]: Database Applications - Data mining;
H.2.5 [Information Systems]: Heterogeneous Databases;
H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval


RelDC , relationship-based data cleaning, object consolidation, record linkage, data mining,

