THE WEB (August 28) Two newer, smarter tools for finding and indexing resources on the Web have been released this summer. The Carnegie Mellon University's Center for Machine Translation announced the public availability of its Lycos (TM) WWW search engine on August 12th, and the Internet Research Task Force Research Group on Resource Discovery's Harvest System has been presented in several papers during the summer. Both systems are now in place for public use.
The Lycos and Harvest systems attack a problem that has plagued many information spaces before the Web--how can a user find resources related to a topic or locate a specific resource? In ftpspace, there's archie; in gopherspace, there's veronica. For the Web, there is a variety of Web robots, wanderers, and spiders that have been crawling through the Web and collecting information about what they find. Oliver McBryan's World-Wide Web Worm, released in March, was a very early ancestor to the newer species of spiders on the Web today. The Worm collected a database of over 100,000 resources and still provides the user with a search interface to its database (current to March 7, 1994). Both Lycos and Harvest build on the Worm's techniques, provide more current databases, and collect them in a more efficient manner.
The secret of Lycos' search technique lies in random choices tempered by preferences. Lycos starts with a given URL and collects information from the resource, including:
While many early Web spiders infested a particular server with a large number of rapid, sequential accesses, Lycos behaves.
With more than 634,000 references in its database as of the end of August, Lycos offers a huge database to locate documents matching a given query. The search interface provides a way for users to find documents that contain references to a keyword, and to examine a document outline, keyword list and an excerpt. In this way, Lycos enables the user to determine if a document might be valuable without having to retrieve it. According to Dr. Mauldin, plans are in the works for allowing users to register pages and for other kinds of searching schemes. Another related project underway is WebAnts aimed at creating cooperating explorers, so that an individual spider doesn't have to do all the work of finding things on the Web or duplicate other spiders' efforts.
The philosophy behind the Harvest system is that it gathers information about Internet resources and customizes views into what is "harvested." According to developer Mike Schwartz, "Harvest is much more than just a 'spider.' It's intended to be a scalable form of infrastructure for building and distributing content, indexing information, as well as for accessing Web information." The complete capabilities of Harvest are beyond the scope of this news article; for further information, the reader is directed to The Harvest Information Discovery and Access System web page.
Harvest consists of several subsystems. A Gatherer collects indexing information and a Broker provides a flexible interface to this information. A user can access a variety of collections of documents. The Harvest WWW Broker, for example, includes content summaries of more than 7,000 Web pages. This databasa has a very flexible interface, providing search queries based on author, keyword, title, or URL-reference. While the Harvest database (the WWW pages) isn't yet as extensive as other spiders', its potential for efficiently collecting a large amount is great.
Other subsystems further refine Harvest's capabilities. Subsystems for Indexing/Searching provides a ways for for a variety of search engines to be used. For example, Glimpse supports very rapid space-efficient searches with interactive queries while Nebula provides fast searches for more complex queries. Another Harvest subsystem, a Replicator, provides a way to mirror information the Brokers have and an Object Cache meets the demand for managing networked information by providing the capability to locate the fastest-responding server to a query.
While spiders like the Worm could successfully crawl through Webspace in first part of 1994, the rapid increase in the amount of information on the Web since then make this same crawl difficult for the older spiders. Harvest's systems and subsystems are extensive and provide for efficient, flexible operation, and its design addresses the very important issue of scalability. Similarly, the Web Ants project addresses this scalability issue through its vision of cooperating spiders crawling through the Web. The promise for the future is that systems like Harvest and Lycos will provide users with increasingly efficient ways to locate information on the Nets. ¤