Search Engine Watch
SEO News

 

  #1  
Old 07-31-2004
flash-seo flash-seo is offline
Member
 
Join Date: Jun 2004
Posts: 6
flash-seo is on a distinguished road
Google Database Technology

I've enjoyed the discussions here very much - as I read them I often read some explanation or other that makes me wonder if a large part of what happens isn't because of the data base technology that Google uses? I rarely see much about it, so a few weeks ago I started to search for and otherwise investigate what they use as a database. I had some hope it would be one of the big name databases, modified for their purposes, but also a feeling that it wouldn't be as I read somewhere, I think in XML Journal, that Google used text files somehow and not a database per se. I'm still not sure exactly what the actual situation is except that from what I can find out Google built their own database in house - completely.

Furthermore it is a "flat file" database, it seems. I found a (relatively) lot of infomation here ...

http://www.smartmoney.com/techmarket...story=20000706

... and I searched for awhile, through database forum and the like but this was about the most information I could come up with. If anyone knows of any resources pertaining to Google database technology I'd sure like to get the urls. It is also worth noting that the article is from the year 2000.

Regardless, it seems to me that there are certain principles that apply for most databases and that Google must use them, flat file data base or no. For example they must have caching of the results for queries and that sort of thing. I guess I'll eventually start searching patents that Google has applied for in relation to their database but I'm sort of hoping someone else already knows and can tell me about them. I mean really, at the end of everything we are really just making queries to a database, so I think it stands to reason that obtaining a better understanding of what is really going on would entail learning about the actual database that they use and examining what sort of database stuff they are doing.

I could be wrong, but it seems obvious to me that Google caches results for specific queries. Maybe Google somehow finds it economical to search their entire database for each query but if it were me and I was confident that there would be some amount greater than 1 searches a day for "chicago real estate" then I'd just run the query and have it stay in the cache for some length of time.

I can think of other instances when I think that I can see database based answers to questions concerning why Google behaves as it does but my knowledge of file database characteristics is really zero. I'm reasonably good with Oracle concepts and features that databases, that is relational databases, have. I couldn't really tell you if there was such a thing as an "indexed" anything on a flat file database, probably the whole column and rows and tables concept is invalid - I really don't have a clue. Why would Google go for a "flat file" database? What are the advantages, especially advantages that might be applicable to how Google does what it does so well, if well is the right word.

I don't really even want to speculate on things since I know nothing about "flat file" databases. I'm sort of interested in how the caching works on them, hopefully the same way as on a "normal" database which can be set up to make it so that a result to a query will stay in the cache as long as there is a certain demand for it -- if caching is the same as the type of database I'm familiar with then I'd speculate that updates are simply a freshening of the cache for a given query and that the time of that update may depend on the number of searches for that item. That is just a quick unthought out theory but it was only meant to serve as an example of how certain questions as to the whys of Google behavior might be found be examining the actual technology end of things, specifically database technology.

Greg.
Reply With Quote
  #2  
Old 08-04-2004
runarb runarb is offline
Member
 
Join Date: Jun 2004
Posts: 8
runarb is on a distinguished road
Relational-database are for structured information, webpages is not structured.

With a SQL database you will probably end up doing a lot of actual searching, and will need a more powerful setup than a search engine based on an inverted index, where you only read from the index.


For instance if you have these 4 documents:

1: "i love you"
2: "god is love"
3: "love is blind"
4: "blind justice"

SQL based search engines typical have the documents in a table, and uses regular expressions to search the table. This is slow because you have to go through all the documents.

A more efficient method is to have an inverted index, where you have all the words in the documents and in which documents they occur, like this:
Code:
blind          3,4
god            2
is             2,3
justice        4
love           1,2,3
i              1
you            1
To find which of those documents have the word "love" in it you can now read the index and se that it is documents 1, 2 and 3.

From the document id's 1, 2 and 3 you can look up the info about those pages in a flat file database easy by multiply the document id with the size of the document record size. This is easy end fast.


The google founders has written a paper about this: http://www-db.stanford.edu/~backrub/google.html
Reply With Quote
Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT -4. The time now is 10:45 PM.