Topic
1/27/2009 Lecture Notes
Kick-off:
No one as Irish as Brack O'Bama
www.oneeyedparrot.org—obama.html
Learning Objective:
Explain the principles behind the design of mercator URL frontier architecture
Explain the things that we are interested in capturing during a crawl
Vector Space Model
Posting list
Connectivity Graph
Update on Assignment #3 and schedule
crawler4j is now on version 1.0.3
group members post names
Less than one week to complete
How are we doing?
Cards
Review Cards
How fast is fast enough for palindromes?
Do crawlers continue to crawl forever?
Should our crawler crawl forever?
What are some cool things crawlers can do?
Send you alerts when they find something
When does a crawler use front queues vs. back queues?
What about for our assignment?
What do we do when the back queues fill up?
Where do the URLs come from that need to be crawled?
seed set then outlinks
What is duplication and shingling
Why is crawling so complicated?
scale
social contracts
adversarial technology
need for robustness
Is mercator the best architecture for crawling?
How is round robin biased toward highest priority?
Do crawlers crawl randomly?
What's the host splitter about?
How do the queues recognize a web page loop?
Video break
nru
www.youtube.com—watch