Jan-Feb 1997, p. 8-11
Lycos: Design choices in an Internet search service
Michael L. Mauldin, Lycos, Inc.
One of the enabling technologies of the World Wide Web, along with browsers, domain name servers, and hypertext markup language, is the search engine. Although the Web contains over 100 million pages of information, those millions of pages are useless if you cannot find the pages you need.
All major Web search engines operate the same way: a gathering program explores the hyperlinked documents of the Web, foraging for Web pages to index. These pages are stockpiled by storing them in some kind of database or repository. Finally, a retrieval program takes a user query and creates a list of links to Web documents matching the words, phrases, or concepts in the query. Although the retrieval program itself is correctly called a search engine, by popular usage the term now means a database combined with a retrieval program. For example, the Lycos search engine comprises the Lycos Catalog of the Internet and the Pursuit retrieval program.
This essay describes the Lycos system for collecting, storing, and retrieving information about pages on the Web. After outlining the history and precursors of the Lycos system, I'll discuss some of the design choices we made in building this Web indexer and touch briefly on the economic issues involved in working with very large retrieval systems.
The Archie directory service was the first Internet search engine, according to the above definition (1). It combined a script-based data gatherer to fetch site listings of anonymous FTP files with a regular expression matcher for retrieving file names matching a user query. Begun in 1990, Archie was a well-established Internet tool by 1992.
Archie's success as an index for FTP files inspired others to create an index for Gopher menus, which are a means of sharing text files across the Internet. The University of Nevada System Computing Services group developed the Veronica system in 1993.
The first gathering program for the Web was the World Wide Web Wanderer (2). Initially, this program only counted Web servers on the Internet, but I later added a retrieval program called Wandex for searching the Internet's output. The Wanderer ran twice a year, from June 1993 through January 1996.
In October 1993, Aliweb (Archie-Like Indexing of the Web), an analog of the Archie system, was developed (3). Aliweb requires the Web server to prebuild an index of all Web pages on that server. The server must also register with Aliweb. Aliweb provides a simple Perl-based retrieval program to search the collected indices. As implemented, the site index in turn relies on author-supplied descriptions of the items being indexed.
In December 1993, three more Internet search engines became available: JumpStation (4), World Wide Web Worm (5), and RBSE Spider (Repository-Based Software Engineering) (6). JumpStation used a Web robot, or spider, to gather information about the title and headers from Web pages, and used a simple exhaustive match to retrieve pages. The WWW Worm indexed only the titles and URLs, using a regular expression search of the whole inverted index. These systems both produced a listing of matching documents in database order, not ranked by any kind of relevance to the user's query.
By contrast, the RBSE spider and the WebCrawler (which went on line April 20, 1994) were the first Internet search engines to implement ranked-relevance retrieval (7). By use of a vector-space retrieval model, WebCrawler listed documents most closely related to the user's query first, making the system much more usable and reducing the time spent analyzing the searcher's output (8). The RBSE spider also supported relevance feedback using WAIS (Wide-Area Indexing Service).
Work on the Lycos spider began in May 1994, using John Leavitt's LongLegs program as a starting point. (Lycos was named for the wolf spider, Lycosidae lycosa, which catches its prey by pursuit, rather than in a web.) In July 1994, I added the Pursuit retrieval engine to allow user searching of the Lycos catalog (although Pursuit was written from scratch for the Lycos project, it was based on experience gained from the ARPA Tipster Text Program in dealing with retrieval and text processing in very large text databases (9)). On July 20, 1994, Lycos went public with a catalog of 54,000 documents. In addition to providing ranked relevance retrieval, Lycos provided prefix matching and word proximity bonuses. But Lycos' main difference was the sheer size of its catalog: by August 1994, Lycos had identified 394,000 documents; by January 1995, the catalog had reached 1.5 million documents; and by November 1996, Lycos had indexed over 60 million documents--more than any other Web search engine. In October 1994, Lycos ranked first on Netscape's list of search engines by finding the most hits on the word "surf."
All Web spiders use essentially the same algorithm to locate documents on the Web:
1. Create a queue of pages to be explored, with at least one Web page in the queue.
2. Choose a page from the queue to explore.
3. Fetch the page chosen in Step 2 and extract all the links to other pages. Add any unexplored page links to the exploration queue.
4. Process the page fetched in Step 3 to extract information such as title, headers, key words, or other information. Store this information in a database.
5. Go to Step 2.
The various spiders differ mainly in how they choose which page to explore next (Step 2) and what information they keep about each page (Step 4). The choice of a starting point is largely irrelevant, because the Web is highly connected and good starting points such as the major Web directories (Yahoo, The Whole Internet Catalog, A2Z, and major universities) are found after only a few plies of search.
Most major search services let users register their own pages with that service. For Lycos, this means that user-submitted pages are added to the initial exploration queue in Step 1. For the Lycos service, user submissions represent a tiny percentage of the total number of pages indexed. By comparison, the early Yahoo system relied heavily on user submission (10). Yahoo as and is a tree-structured hierarchy of descriptions, rather than a search service as defined earlier; the tree was designed by the authors of Yahoo. Individual users were encouraged to find a node in the tree that best fits their Web page and to submit descriptions of their Web pages to be included at that point in the hierarchy.
In the original Lycos implementation (which was used with only minor revisions from May 1994 until February 1996), the spider was written in Perl and used an associative array to hold the queue. Perl allows such arrays to be kept on disk using the Unix DBM database manager, which affords very efficient access of very large queues. Lycos now uses a proprietary spider program written entirely in C.
Choosing where to explore next
One major difference between spiders is how they go about searching. Some early spiders explored the most recently found page first, resulting in a depth-first search of the Web. The load on the target servers thus was high, actually causing some servers to crash (11).
A better choice is to search the least recently found document, resulting in breadth-first search; WebCrawler and the WWW Worm used this technique (the RBSE spider supported both depth-first and breadth-first searches (6)). This reduces the problem of load concentration and increases the coverage of small databases, but it favors smaller Web servers over large megasites such as universities and Internet service providers.
Lycos uses a best-first search based on the popularity heuristic. We define popularity as the number of external Web servers with at least one link to a Web page. External means that only links from one server to a page on another count as evidence of popularity; by only counting one link per server, Lycos limits the ability of authors to manipulate popularity data. This approach tends to find home pages rather than subsidiary pages, so the Lycos catalog is biased toward more popular (and we therefore hope) more useful Web pages.
Choosing what information to keep
In 1994, Web search services were small projects and short of resources. Therefore, most kept only titles or titles and headers from each page (the WWW Worm's default search was only titles; JumpStation kept only titles and headers). Today, most competing services keep the whole text of pages (WebCrawler started out this way). But keeping the whole document increases disk storage costs and raises the issue of copyright violations.
Aliweb relied on author-generated descriptions of documents, as did the early Yahoo service. While these texts were clearly intended for indexing, and therefore hopefully free from copyright issues, manual abstracting is very time-consuming, and coverage of such systems was a problem.
Lycos pioneered the use of automated abstracts. By using standard information-retrieval statistical methods, Lycos identifies the 100 most "weighty" terms (the 100 keywords most related to the document being indexed). Combining these weighty terms with the titles, header text, and an excerpt of the first 20 lines, or 10% of the document, Lycos creates an abstract that is about one-fourth the size of the original document. Lycos can display these abstracts along with the list of links during the retrieval process, allowing users to quickly determine which of the matched documents they wish to examine. Because the Lycos Catalog comprises abstracts and not copies of documents, it can also be licensed or sold to third parties, generating additional funding sources for the service.
The information found by foraging must be stored on disk in some kind of database. Although the various services (including Lycos) closely guard the details of these databases, they share certain characteristics. They must
- allow efficient insertion of new documents (some are optimized for bulk insertion of many documents at once);
- allow efficient update of documents records, when spiders revisit pages already described in the database (again, this step might be optimized for bulk update);
- allow for random read access to any particular document record (required during the retrieval phase); and
- be efficient in terms of disk space, because they must store tens of millions of documents.
One design decision is whether to store the exploration queue in the same data store as the stockpile of exploration results. The early Lycos system used a DBM database to store the exploration queue and a separate sequential file format to store the Lycos Catalog. This system used a sort/merge technique to update the catalog weekly with new Web page information and to replace updated records during the merge.
Once the spiders have collected their information and it has been safely stored in a database, the final step is to process queries from individual users and to return lists of links to matching documents.
Three basic schemes have been used: Boolean keyword query, regular expression matching, and vector space (statistical) retrieval. Aliweb, JumpStation, and the WWW Worm used regular expression matches that scanned the entire database for each query. The results of this approach, however, were presented in database order, not relevance order; also, scanning the entire database is very expensive computationally, and therefore very slow.
The RBSE Spider and WebCrawler, along with Lycos and other commercial Internet search engines, use inverted file indexing. An inverted file (sometimes called a postings file) is a list of all occurrences of words or tokens in the text database. For each unique word in the database, the search engine keeps a list of documents containing that word, sometimes with a list of all the positions where the word occurs (8). For example, to search for information about NASA's Viking probes, the query "Viking Mars Lander" can be processed by intersecting the list of documents under "Viking," the list for "Mars," and the list for "Lander." Although thousands of entries might be under each word, this computation is much cheaper than scanning millions of documents, checking directly for each of these words. By keeping the positions of each occurrence along with the word, the search engine can efficiently check the proximity or adjacency of words during retrieval, searching for documents where "Viking," "Mars," and "Lander" appear close to each other.
The Lycos Pursuit retrieval program uses an inverted file containing for each word occurrence its document identifier, its position in that document, and a flag indicating whether the word occurred in the title, the body, or a header, or was one of the 100 most relevant words for that document. During query processing, the retrieval program ranks the matching documents according to a relevance score based on these features:
- How many of the query terms are contained in the document?
- How frequently are the query terms used in the document?
- How close are the query terms to each other in the document (proximity)?
- Where do the query terms occur in the document (position)?
- How closely do the query terms match the individual words?
For example, a document containing all three words, "Viking," "Mars," and "Lander," ranks above a document containing only "Mars" and "Lander." By using the HTML document structure, Pursuit can also rank documents with those terms in the title above documents that might only contain a passing reference to the Mars lander. Also, word-prefix matching is allowed, so "Viking" matches "Vikings." Similarly, "glow" matches "glows" better than "glowing."
One key to success for Internet search services is to keep the costs of serving queries below the revenue generated by serving advertisements. The Lycos service, like many other Internet search services, generates income mainly through advertising, both targeted and generic. For targeted advertising, the service checks the user's query terms against a list of keywords that have been sold at a premium to the advertisers. For example, if the user queries for "cars," an automobile advertisement can be shown. Depending on the level of targeting, every 1,000 advertisements shown bring in a gross revenue of $20 to $50. Therefore, the cost of performing a query cannot exceed 1 to 2 cents. Keeping this cost low is the secret to a successful service. It is much easier to build a spider to gather 60 million documents than it is to build a retrieval service that can query those documents millions of times a day for less a penny a query. The keys here are a fast database implementation and efficient algorithms for query processing.
Between the summer of 1993 and the end of 1994, an explosion of Internet search services changed the way people use the Internet. Before then, users could only browse pages, clicking on likely looking hypertext links: a process that David Eichmann likened to starting in one city and driving around until you found a road that leads to your destination. Since 1994, most people start their Internet surfing at either a search service or a directory service, and these have become the roadmaps to the Internet.
Changed, too, is the way we view information retrieval. In 1993, IR was a field for researchers, librarians, and information specialists. Today, IR is a consumer commodity, with millions of users every day. The key to that transformation was that services like Lycos were made freely available to the public, with simple interfaces and large databases that let their users rapidly find Web pages relevant to their areas of interest.
Thanks to Brian Pinkerton, Jonathan Fletcher, Martijn Koster, Oliver McBryan, and Jerry Yang for providing background material on their respective services, and to the Braustübl group for their invaluable contributions to the field of Web indexing.
1. A. Emtage and P. Deutsch, "Archie--An Electronic Directory Service for the Internet." Proc. Usenix Winter 1992 Tech. Conf., Usenix Assoc., Berkeley, Calif., 1992, pp. 93-110.
2. M. Gray, "World Wide Web Wanderer," http://www.mit.edu/people/mkgray/net/.
3. M. Koster, "Aliweb--Archie-Like Indexing in the Web," Proc. First Int'l World Wide Web Conf., Elsevier Science, Amsterdam, 1994, pp. 175-182.
4. J. Fletcher, "The Jumpstation," 1993, personal communication. For a good description of the Jumpstation by Steve Bennett of Information Innovation, see http://www.euro.net/innovation/Web_Word_Base/TWW1-html/WebRob1.htm.
5. O. McBryan, "GENVL and WWWW: Tools for Taming the Web," Proc. Second Int'l WWW Conf., Elsevier Science, 1994.
6. D. Eichmann, "The RBSE Spider--Balancing Effective Search against Web Load," Proc. First Int'l World Wide Web Conf., Elsevier Science, 1994.
7. B. Pinkerton, "Finding What People Want: Experiences with the WebCrawler," Proc. Second Int'l World Wide Web Conf., Elsevier Science, 1994.
8. G. Salton, The Transformation, Analysis, and Retrieval of Information by Computer, Addison-Wesley, Reading, Mass., 1989.
9. P. Jacobs et al., "Description of the Shogun System Used for MUC-5," Proc. Fifth Message Understanding Conference, Advanced Research Projects Administration, McLean, Va., 1993, pp. 109-120.
10. "History! Yahoo," http://www.yahoo.com/docs/pr/history.html.
11. M. Koster, "Guidelines for Robots Writers," http://info.Webcrawler.com/mak/projects/robots/guidelines.html.
Michael L. Mauldin is a research computer scientist at the Center for Machine Translation at Carnegie Mellon University and chief scientist at Lycos, Inc., a public company with offices in Boston, New York, and Pittsburgh that is commercializing his Lycos Catalog of the Internet. He is also a principal investigator on the Informedia Digital Video Library project, and was a coprincipal investigator on ARPA's Tipster data-extraction project. He received his PhD and MS in computer science from Carnegie Mellon, and his BA in mathematical science and computer science from Rice University. Contact him at Lycos, Inc., One Oxford Centre, 301 Grant St., 525, Pittsburgh, PA 15219-1408; firstname.lastname@example.org; http://fuzine.vperson.com/mlm/.
Send comments and questions about IEEE Expert Online to Anne Humphreys and Managing Editor Dick Price, email@example.com and firstname.lastname@example.org.
Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution must be obtained from the IEEE. By choosing to view this document, you agree to all provisions of the copyright laws protecting it.
Copyright (c) Institute of Electrical and Electronics Engineers, Inc. All rights reserved.