Fall 2008: Computer Science 221 : Information Retrieval : Assignment 02

Goals:
1. This assignment is designed to:
  1. to teach you how difficult crawling the web is in terms of scale and temporal requirements. The software itself is not the hard part of this assignment, it is managing all the data that comes back. This is only one web site after all.
  2. to teach you how difficult it is to process text that is not written for computers to consume. Again this is a very structured domain as far as text goes.
  3. to make the point that web-scale web-centric activities do not lend themselves to "completeness". In some sense you are never done. So thinking about web algorithms in terms of "finishing" doesn't make sense. You have to change your mindset to "best possible" given resources.
Java Program (100%)
1. Administration
  1. You may work in teams of 1, 2.
2. Write a program to crawl the web.
  1. Inputs
    1. A URL start Page (the seed set)
    2. A regular expression
      1. Pages are only crawled if the url matches this regular expression.
  2. Output
    1. A graph of the crawled pages
    2. An index
      1. of (term, document) pairs
  3. Structure
    1. You can, but don't have to, use two libraries:
      1. WebSphinx
        
        This is a crawler library
        
        You are not to use the "crawler workbench"
        
        You should compose the necessary components (green on right) to build the architecture on the right.
        
        Some keys for reducing memory consumption:
        
        a
      2. WebGraph
        
        This will build a compressed adjacency graph for you
3. Using this architecture search for the following features
  1. Find the longest Palindrome in wikipedia that is not on a page about palindromes.
    1. What I want is a palindrome made from English words. Here are some algorithmic guidelines to help you find that:
      1. First split the page whenever you see a non-ASCII (>127) character. So an arabic character, for example, will never be embedded in a palindrome.
      2. Second strip all punctuation from the text so that only [A-Za-z0-9] remain.
      3. Convert all characters to upper or lower case.
      4. A palindrome consists of the longest common substring between a line of text and its reverse.
      5. Once you identify all palindromes on a page over X characters, make sure that:
        
        greater than 5 characters.
        
        less than 10% of the original text was punctuation.
  2. Find the longest Lipogram (letter "E"/"e") in wikipedia that is not on a page about lipograms.
    1. What I want is a lipogram made from English words. Here are some algorithmic guidelines to help you find that:
      1. First split the page whenever you see a non-ASCII (>127) character. So an arabic character, for example, will never be embedded in a lipogram.
      2. Second strip all punctuation from the text so that only [A-Za-z0-9] remain.
      3. A lipogram is the longest sequence of text which doesn't contain a particular letter.
      4. Once you identify all lipograms on a page over X characters, make sure that:
        
        less than 10% of the original text was punctuation.
  3. Find the Rhopalic with the most number of words in wikipedia.
    1. What I want is a rhopalic made from English words. Here are some algorithmic guidelines to help you find that:
      1. First split the page whenever you see a non-ASCII (>127) character. So an arabic character, for example, will never be embedded in a rhopalic.
      2. For our purposes a rhopalic is a sequence of words in which each word increases by one character.
        
        The first word has N characters. The second word has N+1 characters. Words are separated by at least 1 and no more than 3 spaces, white space, or punctuation.
        
        A valid rhopalic then looks like this regular expression:
        
        \b[A-Za-z]{N}\b[\s!@#$%^&*()-_=+<>,.`~{}\[\]|\\/?]{1,3}\b[A-Za-z]{N+1}\b etc....
        
        Example: "I am the most happy person talking"
4. After crawling the web, use your web graph calculate the shortest path between
  1. from: http://en.wikipedia.org/wiki/Stonehenge
  2. to: http://en.wikipedia.org/wiki/Egyptian_pyramids
  3. that stays in the English content pages of Wikipedia.org
5. (100%) Evaluation:
  1. (60%) Produce the palindrome, lipogram and rhopalic and source URLs from part 3.
    1. Grades will be assigned according to the length of the sequence.
      1. The longest sequence will receive 100%.
      2. The shortest non-trivial sequence will receive 80%
  2. (30%) Produce your sequence of URLs from part 4.
    1. Show this sequence as a collection of screen shots indicating the path so that the instructors can verify the path manually.
    2. Show the anchor text and the URL so that it is easy to verify.
    3. Grades will be assigned according to the length of the sequence.
      1. Shortest sequence will receive 100%
      2. Longest valid sequence will receive 80%. (< 20 hops)
  3. (10%) Email your results as a pdf document
    1. Make the file name
      1. <LastName1> - < LastName2 > - Assignment02.pdf or
      2. <LastName1> - Assignment02.pdf
6. Train your group
  1. Each member of your group must be able to run your architecture on their own for Assignment 03.