Web Agent Related Research at the Center for Machine Translation

To be presented at the SIGNIDR meeting, August 4, 1994 in McLean, Virginia

Web Agent Related Research at the Center for Machine Translation

Michael L. Mauldin, John R. R. Leavitt

CENTER FOR MACHINE TRANSLATION
CARNEGIE MELLON UNIVERSITY
5000 FORBES AVENUE
PITTSBURGH, PA 15213

fuzzy@cmu.edu, jrrl+@cs.cmu.edu

August 4, 1994

ABSTRACT

This paper describes the Lycos and WebAnts project projects at the Center for Machine Translation (CMT) at Carnegie Mellon University. Lycos is an experiment in best-first-search within the space of documents formed by the Web. WebAnts is an experiment in distributing the information collection task to a number of cooperating processors. We also discuss directions for future agents that act as proxies for people who will soon be too busy to answer their email.

Introduction

The Center for Machine Translation at Carnegie Mellon University researches and develops software for automatic translation of text from one language into another. In order to translate text, we have a large technology base of natural language processing resources, algorithms, techniques and expertise.

We have also applied these natural language processing methods to the problems of information retrieval and organization. We are now focussing our attention on the Web as a textual resource of vast breadth and depth.

Lycos: Hunting WWW Information

The Lycos project at Carnegie Mellon is in the early stage, we have a Web explorer in operation, and our indexer will come on-line later this month. We will use the SCOUT indexer which has an HTTP gateway (a set Sample database of the Tipster corpus from Wall Street Journal is available intermittently from the Experimental SCOUT server).

Lycos is written in Perl, but uses a C program based on CERN's libwww to fetch URLs. It uses a random search, keeps its record of URLs visited in a Perl assoc list stored in DBM (thanks to Charlie Stross for the tip that Gnu DBM doesn't have arbitrary limits!). It searches HTTP, FTP, and GOPHER sites, ignoreing TELNET, MAILTO, and WAIS. Lycos uses a data reduction scheme to reduce the stored information about each document:

Title
Headings and Subheadings
100 most "weighty" words (using Tf*IDf weights [Salton & McGill 83])
First 20 lines
Size in bytes
Number of words

Lycos keeps a word frequency count as it runs...it has read over 25 million words. A list of the most frequent words found after searching 6.3 million words is available off the Lycos home page.

Lycos searches the entire Web, not just HTTP servers. So far, the breakdown of file types found is:

         142132 http
         102910 ftp
          84143 gopher
           4314 news
           1396 telnet
            379 mailto
            244 wais
             13 rlogin

Citation counting (number of "parents" by URL): this is the first 20 URLs sorted by number of documents that reference that URL. What I did not do was to count only references from different sites (I'm sure that 99% of the refs to http://gdbwww.gdb.orf/omim come from the Genome Database server itself).

1703 http://gdbwww.gdb.org/omim/
1578 http://cossack.cosmic.uga.edu/keywords.html
 692 ftp://ftp.network.com/IPSEC/rfcindex4.html
 421 ftp://ftp.network.com/IPSEC/rfcindex3.html
 322 ftp://ftp.network.com/IPSEC/rfcauthor.html
 319 ftp://ftp.network.com/IPSEC/rfcindex5.html
 234 ftp://ftp.network.com/IPSEC/rfcindex2.html
 202 ftp://ftp.network.com/IPSEC/rfcindex1.html
 177 http://info.cern.ch/hypertext/WWW/TheProject.html
 166 http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/Docs/whats-new.html
 135 http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/MetaIndex.html
 133 http://www.cs.columbia.edu/~radev/
 133 http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/NCSAMosaicHome.html
 118 http://www.cs.colorado.edu/homes/mcbryan/public_html/bb/summary.html
 108 http://www.mcs.anl.gov/home/gropp/
 107 http://info.cern.ch/hypertext/DataSources/bySubject/Overview.html
 105 http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/StartingPoints/NetworkStartingPoints.html
 101 http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/Docs/help-about.html
 85 http://cui_www.unige.ch/w3catalog

The Lycos philosophy is to keep a finite model of the web that enables subsequent searches to proceed more rapidly. The idea is to prune the ``tree'' of documents and to represent the clipped ends with a summary of the documents found under that node. The 100 most important words lists from several documents can be combined to produce a list of the 100 most important words in the set of documents.

Alternative fixed representations of documents or document sets include the vector models such as Dumais at BellCore [Deerwester et al. 90, Dumais et al. 94] and Gallant & Caid at Hecht-Neilson Corporation [Gallant et al. 94]. The number 100 was chosen arbitarily, so we will need to investigate to find whether than number is too high or too low.

One perhaps unique aspect of Lycos is that it uses the text in the parent document to describe the unexplored links (currently, the highlighted text from each anchor is associated with the URL for that anchor). In this way, Lycos can retrieve documents that it has not even searched.

Once the exploration and retrieval functions are in place, we will turn to experiments in ``best-first-search,'' to be able to process on-demand queries by searching the webv in a knowledge-directed way.

WebAnts: Hunting in Packs

The WebAnts project, also in the early stage, is progressing with cooperation in mind. Like Lycos, this project makes a clear preference for explorer-based schemes over those that require cooperation from each information server [Flater & Yesha 93, Koster 94a, Koster 94b]. There are several motivating issues behind this project. First, information discovery on the web (including gopher-space and ftp-space) is now (and will remain) too large a task. Given that there are currently several thousand web and gopher servers and an order of magnitude more ftp servers, and that these numbers are growing, the scale is too great for the use of a single explorer agent to be effective.

Second, it is undesirable to rely on a single site for such services, as this would create a bottleneck. Creating a single explorer site does nothing to solve the fan-in, fan-out nature of the information discovery problem, but rather exacerbates it, making that one site a bottleneck for all users.

Finally, it is undesirable for multiple explorers to examine the same sites [Eichmann et al. 94, Koster 94c]. This situation would be worse than relying on a single information server, since it is certain that net traffic would be higher and it is uncertain that better results would be achieved. This is, unfortunately, almost what we have now. To search for information on a general topic, a user cannot rely on a single search engine, since none of them explore everything.

To address these issues, the WebAnts project is developing a cooperative explorer agent; that is, an explorer (or ant) which will share results with other ants and not duplicate their efforts. Another system that employs multiple agents is the WebCrawler [Pinkerton 94], although it is unclear as to how distributed these agents are. The WebAnts model could be used for either searching or indexing, although at the moment, we are concentrating on searching, until we have settled on a design for the index.

For searching, we expect this cooperation to be very effective, in that the different ants may be directed based on others' results. If one ant finds a document that satisfies the search criteria, it can share the references from that document with any other ants that are not currently exploring hits of their own. And as each ant explores a document, it lets other ants know, so that they do not have to examine the same document.

For indexing, this cooperation allows each indexer to conserve resources, and also to distribute the indexing load between different explorers. Each index server would provide all the information gathered by one of the ants during exploration. When querying, a user could restrict the query to the local ant or allow it to propagate to the entire colony. While this does not solve the fan-in, fan-out problem at the macro level, it does serve to reduce the bottleneck effect. This is a simpler version of the Übernet used by ALIBI [Flater & Yesha 93].

One issue that the WebAnts project does not yet entirely address, but which any indexer must eventually deal with, is that the size of an index will grow proportionally to the size of the web. The storage, retrieval, and distribution of information on the scale will no doubt prove a compelling challenge in coming years.

The Future of the Web

We categorize Web agents into two broad classes: those that ``pull'' information out, such as the WebCrawler and Lycos, and those that ``push'' information back out, of which the best example is fictional. In Cold as Ice, [Sheffield 92] describes ``faxes,'' which can be defined as expert systems that can oversee experiments, answer questions, and replace many of the functions of secretaries and assistants.

It is this second kind of agent that we hope to build in the future, building on our success with TinyMUD agents such as Julia [Mauldin 94]. Instead of looking for answers, a proxy agent might read newsgroups looking for questions relevant to its owner's areas of interest, might send email to people working on a problem that the its owner has several publications that might be relevant, and might sort, prioritize and even answer email.