PDA

View Full Version : Lets Back to Google's Home


iapain
12-19-2004, 04:59 PM
Google was a part of DLI1 that developed at STANFORD University. Stanford is still working on its DLI2 Project which extends beside crawling i.e mining, clustering algo and search techniques.
DLI2 Project named WEBBASE http://www-diglib.stanford.edu/~testbed/doc2/WebBase/

You can also download the source code of PITA Crawler and Search Server, they both are written in GNU C/C++ and have interface in PHP and Perl.
ftp://db.stanford.edu:/pub/digital_library/

The Details of experiment with WebBase is available at http://www-diglib.stanford.edu/~testbed/doc2/WebBase/webbase-pages.html

Architechture of WebBase is shown Below (C) 2004 Stanford Database Group
http://www-diglib.stanford.edu/~testbed/doc2/WebBase/arch.gif

But the intersting thing is Research carried about during this project. Parallelism and Distributed Crawling is one of my best research paper. http://www-diglib.stanford.edu/~testbed/doc2/WebBase/pubs.html

Unfortunatly there is no help in WebBase reagarding RANKING. Ranking is most important part of any search engine, thats why Google is still on top.

Standford WebBase is ideal place to start, if you are dreaming your own large scale search engine.

-Deepak
--------------

orion
12-21-2004, 11:29 AM
Hi, Deepak

All true and well documented elsewhere.

Orion

iapain
12-21-2004, 12:12 PM
HI,
Ya WebBase is a great project in information retrival and delivery. I hope it add something practicle beside basic analogy of large Scale search engine.

Visualize, GoogleBot Vs Inktomi Slurp Vs MSN Bot.
Sanford PITA (WebVac) is something like MSN Bot, just designed to craw the data while GoogleBot and Inktomi have some extended features like Learnable and focused Crawling. Inktomi uses thi focused crawling. GoogleBot is loaded with Advance learnable Capabilities beside focused crawling.
see: http://citeseer.ist.psu.edu/cache/papers/cs/26913/http:zSzzSzpindex.ku.ac.thzSzfile_researchzSzLearn able_Spider_ISCIT2002.pdf/angkawattanawit02learnable.pdf

This would clear why Google and Inktomi performed less hits than MSN.
Learnable and Focused crawling is also a very important feature of a good crawler.

If you see the source code of Stanford's PITA, it has some feautres like It avoids the Mirrors of a same website.

-Deepak
----------

orion
12-24-2004, 03:42 PM
HI,
Ya WebBase is a great project in information retrival and delivery. I hope it add something practicle beside basic analogy of large Scale search engine.

Well put, Deepak.

Here is a paper describing the WebBase Architexture (http://newdbpubs.stanford.edu:8090/pub/showDoc.Fulltext?lang=en&doc=1999-26&format=pdf&compression=&name=1999-26.pdf) in details.



Orion

iapain
12-24-2004, 04:33 PM
Hi...Great!!!!
I don't know how i skiped that paper...really an important one and it is writtern by one of great proffessor of SF Mr. Hector.
I'll comment on it after deep study.

Thanks! ORION!
Merry Christmas!
-Deepak
------------
Well put, Deepak.
Here is a paper describing the .... in details.
Orion

orion
12-24-2004, 04:50 PM
It is also very briefly discussed/mentioned at these other threads

Page ID section
Is A Trailing / On A Directory Seen As A Differnet File By Google? (http://forums.searchenginewatch.com/showthread.php?p=28626#post28626)

Storage Manager section
When Does Google Really Index a Page? (http://forums.searchenginewatch.com/showthread.php?p=28627#post28627)

Merry Christmas!

Orion