Computer-Mediated Communication Magazine / Volume 1, Number 5 / September 1, 1994 / Page 3

New Spiders Roam the Web

by John December (decemj@rpi.edu)

THE WEB (August 28) Two newer, smarter tools for finding and indexing resources on the Web have been released this summer. The Carnegie Mellon University's Center for Machine Translation announced the public availability of its Lycos (TM) WWW search engine on August 12th, and the Internet Research Task Force Research Group on Resource Discovery's Harvest System has been presented in several papers during the summer. Both systems are now in place for public use.

The Lycos and Harvest systems attack a problem that has plagued many information spaces before the Web--how can a user find resources related to a topic or locate a specific resource? In ftpspace, there's archie; in gopherspace, there's veronica. For the Web, there is a variety of Web robots, wanderers, and spiders that have been crawling through the Web and collecting information about what they find. Oliver McBryan's World-Wide Web Worm, released in March, was a very early ancestor to the newer species of spiders on the Web today. The Worm collected a database of over 100,000 resources and still provides the user with a search interface to its database (current to March 7, 1994). Both Lycos and Harvest build on the Worm's techniques, provide more current databases, and collect them in a more efficient manner.

Lycos

In an email interview, Dr. Michael L. Mauldin, a developer of Lycos, described the spider's unique features. Lycos' software ancestry is from a program called "Longlegs" written by John Leavitt and Eric Nyberg, and the term "Lycos" comes from the arachnid family Lycosidae, which are large ground spiders that are very speedy and active at night, catching their prey by pursuit rather than in a web. Lycos lives up to its name--rather than catching its "prey" (URLs on a server) in a massive single-server sweeps, Lycos uses an innovative, probabilistic scheme to skip from sever to server in Webspace.

The secret of Lycos' search technique lies in random choices tempered by preferences. Lycos starts with a given URL and collects information from the resource, including:

Title
Headings and Subheadings
100 most "weighty" words (using an algorithm which considers word placement and frequencies, among other factors)
First 20 lines
Size in bytes
Number of words

Lycos then adds the URL references in the resource to its queue. To choose the next document to explore, Lycos makes a random choice (among the http, gopher, and ftp references) with built-in "preferences" for documents that have multiple links into them (popular documents) and a slight preference for shorter URLs (to keep the database oriented to the Web's "top").

While many early Web spiders infested a particular server with a large number of rapid, sequential accesses, Lycos behaves.

First, Lycos' random-search behavior avoids the "multiple-hit" problem. Second, Lycos complies with the standard for robot exclusion to keep unwanted robots off WWW servers, and identifies itself as 'Lycos' when crawling, so that webmasters can know when Lycos has hit their server.

With more than 634,000 references in its database as of the end of August, Lycos offers a huge database to locate documents matching a given query. The search interface provides a way for users to find documents that contain references to a keyword, and to examine a document outline, keyword list and an excerpt. In this way, Lycos enables the user to determine if a document might be valuable without having to retrieve it. According to Dr. Mauldin, plans are in the works for allowing users to register pages and for other kinds of searching schemes. Another related project underway is WebAnts aimed at creating cooperating explorers, so that an individual spider doesn't have to do all the work of finding things on the Web or duplicate other spiders' efforts.

The Harvest Project

The The Harvest Information Discovery and Access System reaches beyond being merely a spider, but involves a series of subsystems to create an efficient, flexible, and scalable way to locate information. Harvest is an ambitious project to provide a way to create indexes and provide for efficient use of servers. Work on its development has been supported primarily by Advanced Research Projects Agency, with other support from Air Force Office of Scientific Research (AFOSR), Hughes, National Science Foundation, and Sun. Harvest is being designed and built by the Internet Research Task Force Research Group on Resource Discovery.

The philosophy behind the Harvest system is that it gathers information about Internet resources and customizes views into what is "harvested." According to developer Mike Schwartz, "Harvest is much more than just a 'spider.' It's intended to be a scalable form of infrastructure for building and distributing content, indexing information, as well as for accessing Web information." The complete capabilities of Harvest are beyond the scope of this news article; for further information, the reader is directed to The Harvest Information Discovery and Access System web page.

Harvest consists of several subsystems. A Gatherer collects indexing information and a Broker provides a flexible interface to this information. A user can access a variety of collections of documents. The Harvest WWW Broker, for example, includes content summaries of more than 7,000 Web pages. This databasa has a very flexible interface, providing search queries based on author, keyword, title, or URL-reference. While the Harvest database (the WWW pages) isn't yet as extensive as other spiders', its potential for efficiently collecting a large amount is great.

Other subsystems further refine Harvest's capabilities. Subsystems for Indexing/Searching provides a ways for for a variety of search engines to be used. For example, Glimpse supports very rapid space-efficient searches with interactive queries while Nebula provides fast searches for more complex queries. Another Harvest subsystem, a Replicator, provides a way to mirror information the Brokers have and an Object Cache meets the demand for managing networked information by providing the capability to locate the fastest-responding server to a query.

While spiders like the Worm could successfully crawl through Webspace in first part of 1994, the rapid increase in the amount of information on the Web since then make this same crawl difficult for the older spiders. Harvest's systems and subsystems are extensive and provide for efficient, flexible operation, and its design addresses the very important issue of scalability. Similarly, the Web Ants project addresses this scalability issue through its vision of cooperating spiders crawling through the Web. The promise for the future is that systems like Harvest and Lycos will provide users with increasingly efficient ways to locate information on the Nets. ¤

This Issue / Index