IS Project

Description

In this new project, that was offered in Wintersemester 2014/15 for the first time, the task is to implement/design a full-fledged Web Search engine.

The project is going to be offered again in Wintersemester 2015/16. The website of Prof. Michel's group, that hosts this project, moved to dbis.informatik.uni-kl.de

Registration

  • This project is offered in Wintersemester 2014/15
  • The number of participants is limited.
  • The application deadline is was on September 30, 2014.

News

Date

Announcement

Jan 21, 2015

Note on using external libraries: For encoding/decoding the result in JSON you can use a java library, like JSON.simple

Jan 19, 2015

Reminder: Please don't forget to send us the UDFs that you have created in PostgreSQL and the pdf document containing a table with the requested measures, in addition to the revision number of your code and the urls to the JSON and HTML interfaces.

Jan 5,

2015

Note on deploying your web search engine: The virtual machines are accessable only through ssh (port 22). Access to all other ports is restricted even from the university network. You can use port forwarding to test if you installed and configured your Apache Tomcat server correctly (e.g. ssh -L 8000:localhost:8080 project@IP).

Nov 17, 2014

Note on using external libraries: You can use library for JDBC connection pooling (such as Apache DBCP, C3P0, BoneCP etc.).

Nov 14,2014

Warning on using url-normalization: Be careful while using getNormalizedUrl() from sentric library. For instance, among other things it also removes the "www" prefix in URLs which can lead to problems when crawling. We suggest to use the getRepairedUrl() method instead of getNormalizedUrl() for the purpose.

Nov 12, 2014

Note on using external libraries: You can use the following library for URL normalization:

 

https://github.com/sentric/url-normalization

Nov 04, 2014

Note on sending emails to tutors: As mentioned in the kickoff meeting, we are happy to assist clarifying all kind of questions that you might have. Please keep in mind to send your explicit questions to your tutor cc'ing all other tutors (incl. Prof. Michel). This allows us to collect problematic issues in the assignment sheets that might be of interest to your fellow students. If things cannot be resolved by email; sure, drop by after making an appointment.

Oct 29, 2014

Slides from the Kickoff Meeting available. See here

Oct 10, 2014

Place and time of Kickoff Event

 

Place: 36/336

First Meeting (Kickoff): 13:00, Oct. 29th, Wednesday

 

Content

In this project, a Web Search Engine is to be developed. The core tasks are roughly the following:

  • Implement an HTML Parser.
  • Design and Implement a Web Crawler.
  • Design the required database schema to store the contents of visited pages and the link structure.
  • Write an SQL-based query processor to execute Google-style keyword queries.
  • Devise/Create index structures to accelerate the querying performance.
  • Implement alternate query processors using threshold algorithms.
  • Realize alternate methods to compute the score of how well a document matches the query.
  • For this, implement Google's Pagerank algorithm and integrate it in the scoring model.
  • Implement an HTML-based user interface and a Web service
  • Use the Web services of your fellow student to realize a meta search engine.

Kickoff Meeting

  • On Wednesday, October 29th, at 13:00, room 36/336, a kickoff meeting will take place.
  • In this meeting, we will discuss organizational aspects, provide pointers to reference material/literature, and hand out and discuss the first exercise sheet.
  • The participation in this meeting is mandatory.

Literature

We will introduce the main concepts of the required techniques/tools when handing out the individual exercise sheets. In addition, the following are standard books for databases and information retrieval you might want to consult. We will also give specific pointers to Web sources during the semester.

 

  • Introduction to Information Retrieval, by Christopher D. Manning, Prabhakar Raghavan Hinrich Schütze, 2008.
  • Information Retrieval: Implementing and Evaluating Search Engines,by  Stefan Büttcher, Charles L. A. Clarke, Gordon V. Cormack.
  • Datenbanksysteme: Eine Einführung (German), by Alfons Kemper and André Eickler.
  • Database Management Systems, by Raghu Ramakrishnan and Johannes Gehrke.

Contact