CMU Lycos (tm)
Hunting WWW Information

You can also read these pages as multiple short documents in the Lycos Home Page.


Search

News

December 29, 1994 New catalog:
There is now a single large catalog containing 1,493,787 URLs found between Nov. 21 and Dec. 27 (including 216,008 documents actually retrieved). This catalog also incorporates the new Lycos URL Deletion function (no more pointers to mtv.com).
	216,008 documents fetched totaling 1,306,495,064 bytes
	1,277,779 unexplored URLs with descriptions
	458,129,769 bytes of Lycos summaries
	206,431,394 bytes of inverted index

You'll also notice that the score now indicates how many terms matched (if there was more than 1 term in your query), plus there are bonuses for adjacency of search terms.

December 18, 1994 New hardware:
Since my Pentiums haven't arrived yet, I have donated my own workstation to the cause as the Lycos3 server. I have made arrangements to borrow another Sparc (Lycos4), but it probably needs a new kernel to increase the maximum number of processes before it can be a Lycos server (I tried Friday, and it died by Saturday).

Four additional machines are on order (two P90s and two Sparc 5 clones). I have also obtained a beta copy of Netsite from Netscape Communications to evaluate its speed in comparison to NCSA httpd. I won't be able to unpack and install Netsite until Monday or Tuesday.

Lastly, if anyone knows about serve software to gracefully distribute HTTP load among several servers, please send email to fuzzy@cmu.edu.

December 18, 1994 Network interruptions:
Note that my network connection may be flaky during Dec. 26 and 27th. Here's the announcement from facilities:

Planned Network Upgrades:
-------------------------
In conjunction with the scheduled Cyert Hall Fire System test and the
university holidays, the Data Communications department will be upgrading
various core network components on Monday, December 26th and Tuesday, December
27th.

During these two days, starting at the beginning of Monday December 26th
(midnight) various network components will be worked on, replaced or upgraded.
In some cases, the network service will be down but start working for short
time periods while tests are performed and in other cases long term outages
will exist.

December 13, 1994 New catalog:
The large catalog now contains 1,056,523 documents found by Lycos between Nov 21 and Dec 11th (including 148,667 documents actually retrieved. This also represents the first catalog incorporating the new Lycos URL Deletion function (no more pointers to mtv.com).

	148,667 documents fetched totaling 887,792,616 bytes
	907,856 unexplored URLs with descriptions
	319,514,940 bytes of Lycos summaries
	175,426,295 bytes of inverted index

December 7, 1994 New catalog:
The small catalog is now a subset of the big catalog, containing documents that were retrieved by Lycos between Nov 21 and Dec 5th. It contains

	113,794 documents fetched totaling 675,696,928 bytes
	46,621 unexplored URLs for images or postscript files
	160,415 documents all together

December 7, 1994 New load limits:
To cope with the load, we've been forced to limit access to the larger catalog when the load average exceeds 10.0, and to reject new queries entirely when the load exceeds 15.0. We are adding new hardware and new servers soon. The Pentiums are scheduled to arrive Tuesday -- don't worry, Lycos/Pursuit only does one floating point divide per hit. :-)

December 6, 1994 New catalog:
The big catalog now contains only 840,327 documents, but they were all collected between Nov 21 and Dec 5, so you should see fewer bad links.

	113,794 documents fetched totaling 675,696,928 bytes
	714,764 unexplored URLs with one or more descriptions
	251,648,743 bytes of Lycos summaries
	184,618,784 bytes of inverted index

December 2, 1994 Now running NCSA HTTPD:
In an effort to reduce the system load and improve system response time, we are trying the NCSA HTTPD 1.3 server on the Fuzine server and the Lycos1 server. The Lycos2 server is still running CERN HTTPD.

November 28, 1994 reorganization:
Given that Lycos is handling up to 30,000 requests a day, I have made a smaller catalog (481k recent URLs) the default, and made the big catalog (1.3 million URLs) the test database. For most people, the smaller catalog may be better, since it contains only URLs found in the last two months, and has fewer bad links in it.

More hardware is on the way...stay tuned to this channel.

It really is a fast indexer...it only seems slow because you're sharing with 34,999 other people...

November 25, 1994 New catalog:
The main catalog (June-Nov) is up to 1,284,907 URLs. This 1.3meg catalog will be available tomorrow, and includes:

	175,887 documents fetched totaling 1,081,826,971 bytes
	1,109,020 unexplored URLs with one or more descriptions
	368,328,875 bytes of Lycos summaries
	264,708,701 bytes of inverted index

November 8, 1994 Update:
The main catalog (June-Nov) is up to 999,461 URLs. This 999k catalog includes

	131,173 documents fetched totaling 831,633,976 bytes
	868,288 unexplored URLs with one or more descriptions
	276,290,984 bytes of Lycos summaries
	200,665,350}i bytes of inverted index
The other good news is that we now have a second big disk, so both Lycos and Lycos2 servers have their own copies of the catalog. So the searches should run faster (for awhile).

November 2, 1994 Update:
Okay, I give up. You win. You can run more searches in a day than I can find extra computers to run them.

Lycos ran on one computer for 4 months, on two computers for 2 months, and now you've overloaded the third computer in less than a week.

Okay, seriously. We're getting an additional disk (to improve the inverted file access times), and we've moved data around to reduce NFS file accesses needed to run searches on the big DB.

October 30, 1994 Update:
Because of the heavy demand for Lycos, I am now using 3 computers to provide HTTP service (note that CGI scripts have been re-enabled on Fuzine):

Before the addition of Lycos2 on Friday, Lycos1 died once with with a full proc table and once with a full file table.

Other changes to reduce the load include raising the default match threshhold from 0.20 to 0.40, reducing the default number of hits from 50 to 20, and the commissioning from a graphic artist of an even scarier spider picture for the logo.

October 27, 1994 Update:
The main catalog (June-Oct) is up to 862,858 URLs. This 862k catalog includes

	109,462 documents fetched totaling 699,070,847 bytes
	753,396 unexplored URLs with one or more descriptions
	235,741,898 bytes of Lycos summaries
	171,769,116 bytes of inverted index

October 26, 1994 Update:
To see the Lycos usage, check out these documents:

October 10, 1994 Update:
There is now a Forms-based Lycos search page that allows you to set the min-score, max-hits, and terse mode.

October 9, 1994 Update:
You can now request that Lycos explore a specific URL by using the Lycos URL Registry.

October 5, 1994 Update:
The test catalog (June-Oct) is up to 701,466 URLs, including all URLs from the production catalog (June-Sep). This 701k catalog includes

	 84,239 documents fetched totaling 531,276,671 bytes
	617,227 unexplored URLs with one or more descriptions
	180,980,745 bytes of Lycos summaries
	110,741,009 bytes of inverted index

September 20, 1994 Update:
I've merged all the Lycos search results and removed duplicate URLs (by name, not content), so the main Lycos search now covers 547,675 unique URLs.

September 4, 1994 Update:
Lycos/Pursuit is now available for courageous beta testers. The Lycos beta test source is a compressed tar file.

Documentation is included, but is still minimal. The faint of heart may wish to wait a few days for better documentation. Users desiring new features should check the Lycos To Do List to see if that feature is already on the list.

August 26, 1994 Update:
Carnegie Mellon has dedicated a Sparcstation, lycos.cs.cmu.edu, to running Lycos searches. This machine was used for the ARPA Tipster phase I program, and has now been reassigned. Please note the new Lycos search engine URL and update your hotlists and web pages accordingly.

August 15, 1994 Update:
I've merged the June and August catalogs...so there may well be some duplicates in the test version of Lycos. The test catalog is 634,066 documents, 152.9 megabytes. I will be modifying the PURSUIT engine to weed out duplicates by default.

August 14, 1994 Update:
Current catalog is up to 173,000 documents, and 49 megabytes.

August 11, 1994 Update:
Lycos is searching the web again, and it's current catalog is available here. So far, starting from scratch on August 7, has found 4,784 HTTP servers, 18,687 documents (totalling 56 megabytes of text), and the names of 115,000 more documents. The new catalog is up to 37 meg.

Frequently Asked Questions

Adding and Deleting URLs

Usage statistics

Purpose

Lycos is a research program in providing information retrieval and discovery in the WWW, using a finite memory model of the web to guide intelligent, directed searches for specific information needs. Lycos currently implements basic information retrieval on rich abtracts of Web documents.

The next experiment is to add the ability to do best-first search starting with the finite document set to find specific topics. Until then, you can cast around using the search function.

Specifics

Lycos is in alpha test. The search engine runs on fuzine.vperson.com, and the href="http://fuzine.vperson.com/">fuzine.vperson.com, and the web crawler runs on dallas.mt.cs.cmu.edu. The crawler was first run May 1, 1994, and fetches about 5000 documents a day when running.

Lycos's web crawler is written in PERL, with a C program that uses CERN's libwww library to fetch documents. Lycos will not fetch TELNET, MAILTO, NEWS, FILE, or WAIS type files (that leaves mostly HTTP, GOPHER and FTP files). It also ignores files that start with "/dev/tty" or end with with these extensions: AU, AVI, BIN, DAT, DVI, EXE, FLI, GIF, GZ, HDF, HQX, JPEG, LHA, MAC, MPEG, PS, TAR, TGA, TIFF, UU, UUE, WAV, Z or ZIP.

Lycos's search engine, PURSUIT, is a C program that uses a disk-based inverted file retrieval system and a simple sum of weights to score documents. One unique feature is that PURSUIT scores words by how far into the document they appear. Thus hits in the title or first paragraph are scored higher. As soon as the bugs are squashed, the search engine will be made available to all for non-commercial use (send mail to fuzzy@cmu.edu if you would like to be a beta tester).

We might upgrade the search engine's language at some future point to implement more standard boolean operators. We will definitely add the spelling correction and phonetic and semantic match capabilities from the SCOUT project.

For each document fetched, Lycos keeps the title, headings, subheadings, and links, plus the 100 highest weighted words (using Tf*IDf weghting) plus the first 20 lines. Lycos uses a random search to prevent bunching up accesses to any one server.

Lycos now complies with the standard for robot exclusion to keep unwanted robots off WWW servers, and sets the USER-AGENT field to "Lycos".

Code for Fetching URLs and implementing Robot Exclusion

Results

Archives

Posts to the Robots Mailing list are archived here:

Posts to NetNews are archived here:

SIGNIDR 94 materials

Etymology

Lycos comes from Lycosidae, a cosmopolitan family of relatively large active ground spiders that catch their prey by pursuit, rather than in a web. They are noted for their running speed, and are especially active at night.

Related Information

Martijn Koster maintains a list of WWW Robots, and he also has available code to implement the robot exclusion standard.

The Robots mailing list is for discussion of issues related to automated Web searching programs.

Acknowledgments

The Lycos web crawler was derived from the Longlegs program written by John Leavitt and Eric Nyberg.

Some of the hardware used by Lycos was originally purchased with funds provided by ARPA for the Tipster phase I program, and Michael Mauldin is partially supported by ARPA's CS-TR project.

back to the Lycos Home Page.


Last updated 07-Dec-94