XML Retrieval

Introduction

XML is a popular format for storing all kinds of data – from database-like records to textual documents. Of particular interest to us are large collections of long texts, for example in digital libraries [Dop06]. When thousands of books are available in electronic form, good search support can be an important feature that provides significant advantages over (paper-based) traditional libraries. As witnessed by web search engines, it is feasible to search vast collections of text in reasonable time.

Compared to web search engines, however, users of digital libraries can (and should) have higher expectations: A book is simply too long to be a suitable retrieval result – even if the user knows that the information he is looking for is somewhere in that 300-page book, he still has to find the most relevant passage in the book. Obviously, this task should be delegated to the search engine as far as possible, and the semistructured XML format supports this well.

The aim of our project is to develop an XML retrieval engine that not only finds the most relevant documents, but also the most relevant parts in these documents. If a single section satisfies the user's information need, the section should be returned, and not the complete book.

Contact

Philipp Dopichaj

INEX

The Initiative for the Evaluation of XML Retrieval provides a testbed for the evaluation of the effectiveness of XML retrieval methods. We participated and submitted retrieval runs in 2005, 2006, and 2007. Furthermore, we provided relevance assessments in 2004.

Diploma Theses

Publications

2008

Dopichaj, P.:
Content-oriented retrieval on document-centric XML
PhD thesis, January 2008

2007

Dopichaj, P.:
The Simplest XML Retrieval Baseline That Could Possibly Work
to appear in: Focused access to XML documents (Proc. INEX 2007)

Dopichaj, P.:
Improving Content-Oriented XML Retrieval by Applying Structural Patterns
in: Proc. ICEIS, Funchal, June 2007.

Dopichaj, P.:
The University of Kaiserslautern at INEX 2006, in: Comparative Evaluation of XML Information Retrieval Systems, 5th International Workshop of the Initiative for the Evaluation of XML Retrieval, INEX 2006, pp. 223-232

Dopichaj, P.:
Improving Content-Oriented XML Retrieval by Exploiting Small Elements
in: Proc. Workshops 24th BNCOD (BNCODwebim), Glasgow, July 2007

Dopichaj, P.:
Space-efficient Indexing of XML Documents for Content-Only Retrieval, accepted for Datenbank-Spektrum 23 (November 2007)

2006

Dopichaj, P:
Element Retrieval in Digital Libraries: Reality Check
in: Proc. SIGIR 2006 Workshop on XML Element Retrieval Methodology

Dopichaj, P:
The University of Kaiserslautern at INEX 2005
in: Proc. INEX 2005, Dagstuhl November 2005.

2005

Dopichaj, P.:
Element Relationship: Exploiting Inline Markup for Better XML Retrieval
in: Proc. BTW, Karlsruhe, March 2005.

2004

Dopichaj, P.:
Exploiting Background Knowledge for Better Similarity Calculation in XML Retrieval
in: Proc. of the 21st Annual British National Conference on Databases Volume 2 (Doctoral Consortium), Edinburgh, July 2004.

Dopichaj, P., Härder, T.:
Conflation Methods and Spelling Mistakes – A Sensitivity Analysis in Information Retrieval
in: Proc. 16. Workshop "Grundlagen von Datenbanken", Monheim, Germany, June 2004, pp. 48-52.

2002

Schmitt, S., Dopichaj, P., Domínguez-Marín, P.:
Entropy-Based vs. Similarity-Influenced: Attribute Selection Methods for Dialogs Tested on Different Electronic Commerce Domains
in: Advances in Case-Based Reasoning, 6th European Conference, ECCBR 2002 Aberdeen, Scotland, UK, September 4-7, 2002, Proceedings, pp. 380-394