iapain
12-19-2004, 04:59 PM
Google was a part of DLI1 that developed at STANFORD University. Stanford is still working on its DLI2 Project which extends beside crawling i.e mining, clustering algo and search techniques.
DLI2 Project named WEBBASE http://www-diglib.stanford.edu/~testbed/doc2/WebBase/
You can also download the source code of PITA Crawler and Search Server, they both are written in GNU C/C++ and have interface in PHP and Perl.
ftp://db.stanford.edu:/pub/digital_library/
The Details of experiment with WebBase is available at http://www-diglib.stanford.edu/~testbed/doc2/WebBase/webbase-pages.html
Architechture of WebBase is shown Below (C) 2004 Stanford Database Group
http://www-diglib.stanford.edu/~testbed/doc2/WebBase/arch.gif
But the intersting thing is Research carried about during this project. Parallelism and Distributed Crawling is one of my best research paper. http://www-diglib.stanford.edu/~testbed/doc2/WebBase/pubs.html
Unfortunatly there is no help in WebBase reagarding RANKING. Ranking is most important part of any search engine, thats why Google is still on top.
Standford WebBase is ideal place to start, if you are dreaming your own large scale search engine.
-Deepak
--------------
DLI2 Project named WEBBASE http://www-diglib.stanford.edu/~testbed/doc2/WebBase/
You can also download the source code of PITA Crawler and Search Server, they both are written in GNU C/C++ and have interface in PHP and Perl.
ftp://db.stanford.edu:/pub/digital_library/
The Details of experiment with WebBase is available at http://www-diglib.stanford.edu/~testbed/doc2/WebBase/webbase-pages.html
Architechture of WebBase is shown Below (C) 2004 Stanford Database Group
http://www-diglib.stanford.edu/~testbed/doc2/WebBase/arch.gif
But the intersting thing is Research carried about during this project. Parallelism and Distributed Crawling is one of my best research paper. http://www-diglib.stanford.edu/~testbed/doc2/WebBase/pubs.html
Unfortunatly there is no help in WebBase reagarding RANKING. Ranking is most important part of any search engine, thats why Google is still on top.
Standford WebBase is ideal place to start, if you are dreaming your own large scale search engine.
-Deepak
--------------