Lehrgebiet InformationssystemeFB Informatik |
||
|
Entity Identification in XML DocumentsLeonardo RibeiroKaiserslautern University of TechnologyDept. of Computer Science (AG DBIS) P.O. Box 3049, 67653 Kaiserslautern, Germany e-mail: aguiar@informatik.uni-kl.de Theo HärderKaiserslautern University of TechnologyDept. of Computer Science (AG DBIS) P.O. Box 3049, 67653 Kaiserslautern, Germany e-mail: haerder@informatik.uni-kl.de Full paper (PDF version)AbstractAs a natural result of the dissemination of a large variety of XML databases, the well-known problem of data integration must be faced from the XML viewpoint. One of the basic functions of an integration system is the record linkage, the task of comparing records to determine those that are differently represented, but relate to the same entity. As a consequence of the intrinsically high computation cost, the majority of the approaches to record linkage are based on off-line procedures. Such approaches, however, just meet the requirements of data integration architectures that materialize the data such as data warehouses. Recent approaches based on approximate joins are aimed at enabling duplicate identification in on-line procedures with reasonable results. In this paper, we proceed along this research direction and outline our current ideas how to account for the specific characteristics of XML documents.
|