Create a postings list that will work for the cosine similarity algorithm.
The input is the output from your crawl of Wikipedia
or if you could never get your crawl to work, an equivalent crawl of the static Wikipedia snapshot.
The first output is a postings list with meta data
The head of every list should have the term for that list, plus the document frequency for that term
Every entry in the list should be a document that has that term, plus the term frequency of that term in that document
The challenge
This assignment is hard because of the scale. Use as much data as you can.
Evaluation
(in person)
I am going to give you five terms and I want you to return a list of documents and measure:
Precision:
If I give you a word, does the document list that you give me have that word in it?
You should get 100% precision.
Recall (based on google's results)
:
If I give you a word, does your document list include all the documents that have that word in it?