Lehrgebiet Informationssysteme

FB Informatik

LG IS

AG DBIS

(C) AG DBIS

Improving Content-Oriented XML Retrieval by Applying Structural Patterns

Philipp Dopichaj

Fachbereich Informatik
Technische Universität Kaiserslautern
Gottlieb-Daimler-Straße
D-67663 Kaiserslautern
dopichaj@informatik.uni-kl.de

Full paper

Abstract:

XML is the perfect format for storing (mostly) textual documents in a knowledge management system; its flexibility enables users to store both highly structured data and free text in the same document. For knowledge management, it is important to be able to search the free-text parts effectively; users need to find the information that helps them solve their problem without having to wade through much information that is not relevant for their problem. Content-oriented XML retrieval addresses this challenge: In contrast to traditional information retrieval, documents are not considered atomic units, that is, elements such as sections or paragraphs can be returned. One implication of this is that results can overlap (for example a paragraph and the surrounding section). Although overlapping results are undesirable in the final retrieval result as presented to the user, they can help to improve the quality of the final result: We take advantage of overlaps by applying patterns to small subtrees of the retrieval result (result contexts); matching patterns adjust the retrieval status values of the involved node in order to promote the best results. We demonstrate on the INEX 2005 test collection that this postprocessing can lead to a significant improvement in retrieval quality.

Proc. ICEIS 2007, Funchal, June 2007.