Fall 2008: Computer Science 221 : Information Retrieval : Assignment 05

Computer Science 221: Information Retrieval:

Assignment 05

Fall 2008

Due 11/26/2008

Build a postings list
1. Create a postings list that will work for the cosine similarity algorithm.
2. The input is the output from your crawl of Wikipedia or if you could never get your crawl to work, an equivalent crawl of the static Wikipedia snapshot.
3. The first output is a postings list with meta data
  1. The head of every list should have the term for that list, plus the document frequency for that term
  2. Every entry in the list should be a document that has that term, plus the term frequency of that term in that document
The challenge
1. This assignment is hard because of the scale. Use as much data as you can.
Evaluation (in person)
1. I am going to give you five terms and I want you to return a list of documents and measure:
  1. Precision:
    1. If I give you a word, does the document list that you give me have that word in it?
    2. You should get 100% precision.
  2. Recall (based on google's results) :
    1. If I give you a word, does your document list include all the documents that have that word in it?
Practice:
1. ProudleDuck
  1. http://en.wikipedia.org/wiki/Seraphim_proudleduck#Seraphim_proudleduck
  2. http://en.wikipedia.org/wiki/SEO_contest
  3. http://en.wikipedia.org/wiki/Wikipedia:April_Fool%27s_Main_Page/Did_You_Know/Archive_2007
  4. http://en.wikipedia.org/wiki/Google_bomb
This is optionally a group project for groups of 2 only