Lycos: Specifics

See also: News Purpose Specifics Results Documentation


Specifics

Lycos is in beta test. The search engine runs on two SparcStations, lycos1.cs.cmu.edu and lycos2.cs.cmu.edu, and the catalog builder (sometimes called a Web Crawler) runs on dallas.mt.cs.cmu.edu and clipper.mt.cs.cmu.edu. The dallas machine fetches about 5,000 documents a day, and clipper (a Sparc 20) fetches about 20,000 documents a day.

Lycos's web explorer is written in PERL, with a C program that uses CERN's libwww library to fetch documents. Lycos will not fetch TELNET, MAILTO, NEWS, FILE, or WAIS type files (that leaves mostly HTTP, GOPHER and FTP files). It also ignores files that start with "/dev/tty" or end with with these extensions: AU, AVI, BIN, DAT, DVI, EXE, FLI, GIF, GZ, HDF, HQX, JPEG, LHA, MAC, MPEG, PS, TAR, TGA, TIFF, UU, UUE, WAV, Z or ZIP.

Lycos's search engine, PURSUIT, is a C program that uses a disk-based inverted file retrieval system and a simple sum of weights to score documents. One unique feature is that PURSUIT scores words by how far into the document they appear. Thus hits in the title or first paragraph are scored higher. Intrepid beta testers may see and try the code for themselves (Lycos beta test).

We plan to upgrade the search engine's language at some future point to implement more standard boolean operators. We will definitely add the spelling correction and phonetic and semantic match capabilities from the SCOUT project.

For each document fetched, Lycos keeps the title, headings, subheadings, and links, plus the 100 highest weighted words (using Tf*IDf weghting) plus the first 20 lines. Lycos uses a random search to prevent bunching up accesses to any one server.

Lycos now complies with the standard for robot exclusion to keep unwanted robots off WWW servers, and sets the USER-AGENT field to "Lycos".

Code for Fetching URLs and implementing Robot Exclusion

back to the Lycos Home Page.


Last updated 28-Nov-94