Computer Science 221: Information Retrieval:

Assignment 05

Fall 2008

Department of Informatics

Donald Bren School of Information and Computer Sciences

University of California, Irvine

Home | Administrative Policies | Course Structure | Resources & Materials | Calendar

Due 11/26/2008

  1. Build a postings list
    1. Create a postings list that will work for the cosine similarity algorithm.
    2. The input is the output from your crawl of Wikipedia or if you could never get your crawl to work, an equivalent crawl of the static Wikipedia snapshot.
    3. The first output is a postings list with meta data
      1. The head of every list should have the term for that list, plus the document frequency for that term
      2. Every entry in the list should be a document that has that term, plus the term frequency of that term in that document
  2. The challenge
    1. This assignment is hard because of the scale. Use as much data as you can.
  3. Evaluation (in person)
    1. I am going to give you five terms and I want you to return a list of documents and measure:
      1. Precision:
        1. If I give you a word, does the document list that you give me have that word in it?
        2. You should get 100% precision.
      2. Recall (based on google's results) :
        1. If I give you a word, does your document list include all the documents that have that word in it?
  4. Practice:
    1. ProudleDuck
      1. http://en.wikipedia.org/wiki/Seraphim_proudleduck#Seraphim_proudleduck
      2. http://en.wikipedia.org/wiki/SEO_contest
      3. http://en.wikipedia.org/wiki/Wikipedia:April_Fool%27s_Main_Page/Did_You_Know/Archive_2007
      4. http://en.wikipedia.org/wiki/Google_bomb
  5. This is optionally a group project for groups of 2 only