Lycos Project Description

Posted to robots@nexor.co.uk on June 6, 1994

The Lycos project at Carnegie Mellon is in the early stage, we have a Web explorer in operation, and our indexer will come on-line later this month. We will use the SCOUT indexer which has an HTTP gateway (a set Sample database of the Tipster corpus from Wall Street Journal is available intermittently from the Experimental SCOUT server

Lycos is written in Perl, but uses a C program based on CERN's libwww to fetch URLs. It uses a random search, keeps its record of URLs visited in a Perl assoc list stored in DBM (thanks to Charlie Stross for the tip that Gnu DBM doesn't have arbitrary limits!). It searches HTTP, FTP, and GOPHER sites, ignoreing TELNET, MAILTO, and WAIS. Lycos uses a data reduction scheme to reduce the stored information about each document:

Lycos keeps a word frequency count as it runs...it has read over 25 million words. A list of the most frequent words found after searching 6.3 million words is available off the Lycos home page.

So far, Lycos has run for less than a month

Citation counting (number of "parents" by URL): this is the first 50 URLs sorted by number of documents that reference that URL. What I did not do was to count only references from different sites (I'm sure that 99% of the refs to http://gdbwww.gdb.orf/omim come from the Genome Database server itself.


1703 http://gdbwww.gdb.org/omim/
1578 http://cossack.cosmic.uga.edu/keywords.html
 692 ftp://ftp.network.com/IPSEC/rfcindex4.html
 421 ftp://ftp.network.com/IPSEC/rfcindex3.html
 322 ftp://ftp.network.com/IPSEC/rfcauthor.html
 319 ftp://ftp.network.com/IPSEC/rfcindex5.html
 234 ftp://ftp.network.com/IPSEC/rfcindex2.html
 202 ftp://ftp.network.com/IPSEC/rfcindex1.html
 177 http://info.cern.ch/hypertext/WWW/TheProject.html
 166 http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/Docs/whats-new.html
 135 http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/MetaIndex.html
 133 http://www.cs.columbia.edu/~radev/
 133 http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/NCSAMosaicHome.html
 118 http://www.cs.colorado.edu/homes/mcbryan/public_html/bb/summary.html
 108 http://www.mcs.anl.gov/home/gropp/
 107 http://info.cern.ch/hypertext/DataSources/bySubject/Overview.html
 105 http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/StartingPoints/NetworkStartingPoints.html
 101 http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/Docs/help-about.html
  85 http://cui_www.unige.ch/w3catalog
  84 http://wings.buffalo.edu/world
  82 http://sass577.endo.sandia.gov/SEACAS/CUBIT/Developers/
  80 http://cui_www.unige.ch/OSG/MultimediaInfo/mmsurvey/
  79 http://www.nta.no/telektronikk/4.93.dir/
  76 http://asp.esam.nwu.edu/chris/dce_prodlist.html
  76 http://hypatia.gsfc.nasa.gov/NASA_homepage.html
  76 http://info.cern.ch/hypertext/DataSources/WWW/Servers.html
  75 http://www.ncsa.uiuc.edu/demoweb/demo.html
  75 http://www.rtd.com/people/rawn/
  74 ftp://ftp.network.com/IPSEC/rfcindex0.html
  74 http://tns-www.lcs.mit.edu/cgi-bin/value-added/sports/register.sos.texas.gov/texreg/
  73 http://rs560.cl.msu.edu/weather/getmegif.html
  71 http://rs560.cl.msu.edu/weather/interactive.html
  70 http://rs560.cl.msu.edu/weather/textindex.html
  70 http://rs560.cl.msu.edu/~henrich/
  70 http://www.seas.upenn.edu/~mengwong/
  68 http://info.cern.ch/hypertext/DataSources/WWW/Geographical.html
  68 http://rs560.cl.msu.edu/weather/uscmp.gif
  66 http://rs560.cl.msu.edu/weather/uscmp.mpg
  66 http://www.cso.uiuc.edu/~kline/cvk.html
  65 ftp://cs.nott.ac.uk/pub/sat-images/
  65 http://rs560.cl.msu.edu/weather/goes7ir.mpg
  65 http://rs560.cl.msu.edu/weather/worldir.mpg
  65 http://www.hmc.edu/~irilyth/diplomacy/
  64 gopher://burrow.cl.msu.edu/00/news/weather/lan
  64 gopher://ssec.wisc.edu
  64 http://rs560.cl.msu.edu/weather/6panel.mpg
  64 http://rs560.cl.msu.edu/weather/d2.jpg
  64 http://rs560.cl.msu.edu/weather/gmsvis.mpg
  63 http://cui_www.unige.ch/meta-index.html
  63 http://rd13doc.cern.ch/public/doc/Rd13StatusReport.html

The Lycos philosophy is to keep a finite model of the web that enables subsequent searches to proceed more rapidly. The idea is to prune the "tree" of documents and to represent the clipped ends with a summary of the documents found under that node. The 100 most important words lists from several documents can be combined to produce a list of the 100 most important words in the set of documents.

Alternative fixed representations of documents or document sets include the vector models such as Dumais at BellCore and Gallant & Caid at Hecht-Neilson Corp. The number 100 was chosen arbitarily, so we will need to investigate to find whether than number is too high or too low.

I also subscribe to the dream of a single format and indexing scheme that each server runs on its own data, but given the current state of the community I believe it is premature to settle on a single format. Various information retrieval schemes depend on wildly different kinds of data. We should try out more ideas and evaluate them carefully and only then should we try to settle on a single format.

Resources

I have agreed to share my code for research and educational users. Should I make a requirement that recipients of the code post to this mailing list so we can keep track of its proliferation? I already have promised code to two people.

I will make lists, statistics, reports, and the index server accessible off the Lycos home page as they become available.

--Michael L. Mauldin
  Carnegie Mellon University
  Center for Machine Translation
  5000 Forbes Avenue
  Pittsburgh, PA 15213-3890
  fuzzy@cmu.edu
Back to the Lycos Home Page.


Last updated 10-Jun-94