IS Project
Description
In this new project, that was offered in Wintersemester 2014/15 for the first time, the task is to implement/design a full-fledged Web Search engine.
The project is going to be offered again in Wintersemester 2015/16. The website of Prof. Michel's group, that hosts this project, moved to dbis.informatik.uni-kl.de
Registration
- This project is offered in Wintersemester 2014/15
- The number of participants is limited.
- The application deadline is was on September 30, 2014.
News
Date | Announcement |
---|---|
Jan 21, 2015 | Note on using external libraries: For encoding/decoding the result in JSON you can use a java library, like JSON.simple |
Jan 19, 2015 | Reminder: Please don't forget to send us the UDFs that you have created in PostgreSQL and the pdf document containing a table with the requested measures, in addition to the revision number of your code and the urls to the JSON and HTML interfaces. |
Jan 5, 2015 | Note on deploying your web search engine: The virtual machines are accessable only through ssh (port 22). Access to all other ports is restricted even from the university network. You can use port forwarding to test if you installed and configured your Apache Tomcat server correctly (e.g. ssh -L 8000:localhost:8080 project@IP). |
Nov 17, 2014 | Note on using external libraries: You can use library for JDBC connection pooling (such as Apache DBCP, C3P0, BoneCP etc.). |
Nov 14,2014 | Warning on using url-normalization: Be careful while using getNormalizedUrl() from sentric library. For instance, among other things it also removes the "www" prefix in URLs which can lead to problems when crawling. We suggest to use the getRepairedUrl() method instead of getNormalizedUrl() for the purpose. |
Nov 12, 2014 | Note on using external libraries: You can use the following library for URL normalization:
|
Nov 04, 2014 | Note on sending emails to tutors: As mentioned in the kickoff meeting, we are happy to assist clarifying all kind of questions that you might have. Please keep in mind to send your explicit questions to your tutor cc'ing all other tutors (incl. Prof. Michel). This allows us to collect problematic issues in the assignment sheets that might be of interest to your fellow students. If things cannot be resolved by email; sure, drop by after making an appointment. |
Oct 29, 2014 | Slides from the Kickoff Meeting available. See here |
Oct 10, 2014 | Place and time of Kickoff Event
Place: 36/336 First Meeting (Kickoff): 13:00, Oct. 29th, Wednesday
|
Content
In this project, a Web Search Engine is to be developed. The core tasks are roughly the following:
- Implement an HTML Parser.
- Design and Implement a Web Crawler.
- Design the required database schema to store the contents of visited pages and the link structure.
- Write an SQL-based query processor to execute Google-style keyword queries.
- Devise/Create index structures to accelerate the querying performance.
- Implement alternate query processors using threshold algorithms.
- Realize alternate methods to compute the score of how well a document matches the query.
- For this, implement Google's Pagerank algorithm and integrate it in the scoring model.
- Implement an HTML-based user interface and a Web service
- Use the Web services of your fellow student to realize a meta search engine.
Kickoff Meeting
- On Wednesday, October 29th, at 13:00, room 36/336, a kickoff meeting will take place.
- In this meeting, we will discuss organizational aspects, provide pointers to reference material/literature, and hand out and discuss the first exercise sheet.
- The participation in this meeting is mandatory.
Literature
We will introduce the main concepts of the required techniques/tools when handing out the individual exercise sheets. In addition, the following are standard books for databases and information retrieval you might want to consult. We will also give specific pointers to Web sources during the semester.
- Introduction to Information Retrieval, by Christopher D. Manning, Prabhakar Raghavan Hinrich Schütze, 2008.
- Information Retrieval: Implementing and Evaluating Search Engines,by Stefan Büttcher, Charles L. A. Clarke, Gordon V. Cormack.
- Datenbanksysteme: Eine Einführung (German), by Alfons Kemper and André Eickler.
- Database Management Systems, by Raghu Ramakrishnan and Johannes Gehrke.
Contact