View Full Version : Inverted Word Barrels
Nacho
03-30-2005, 02:43 PM
This thread was split from the Keyword Density Analysis Nonsense (http://forums.searchenginewatch.com/showthread.php?p=42416#post42416) where Mel begins with this post...
I actually need to make a correction - I was completely wrong about KW Density not being used at all.
KW Density is used (as a colleague of mine pointed out) for retrieving on-topic documents. In other words, it's a great, simple way to see if a page is on the topic of the user's query - from there the search engine can rank it against all the other 'on-topic' documents...
But I believe that at least one engine does not use this method to select pages for a ranking pool of on-topic documents. Google have the pages presorted into inverted word barrels and only need to select as many from the appropriate barrels as they want.
lokust
04-09-2005, 07:00 AM
Hi Mel,
Could you elaborate on these "inverted word barrels" that Google uses.
Thanks.
You can read the original paper here (http://www-db.stanford.edu/~backrub/google.html) but in a nutshell Google puts information regarding one particular word into a seperate file (barrel) so if someone searches for fruit all it needs to do is go to the fruit barrel and use the first n results as its ranking pool.
An interesting thing about these barrels is that there are two sets of them, one which only contains words found in the page title and anchor text and another which contains all the hits on a word in a document. Google always goes to the page title barrel first and only goes to the second barrel for ranking candidates if there are not enough results in the page title barrels.
Now if we only knew the details of the ordering system they use for the barrels.
Now if we only knew the details of the ordering system they use for the barrels.
ranking data is kept elsewhere, the 'barrels' are organized in a way which allows efficient search, insertion and deletion.
-rsk
Sorry Risk I do not believe that is correct, but am always willing to learn.
I am basing my post on Googles Anatomy of a Large Scale Hypertextual Web Search Engine paper which states that the process for evaluation of a search query is:
# Parse the query.
# Convert words into wordIDs.
# Seek to the start of the doclist in the short barrel for every word.
# Scan through the doclists until there is a document that matches all the search terms.
# Compute the rank of that document for the query.
# If we are in the short barrels and at the end of any doclist, seek to the start of the doclist in the full barrel for every word and go to step 4.
# If we are not at the end of any doclist go to step 4.
Sort the documents that have matched by rank and return the top k.
While I agree that the algorithm to rank the results from the ranking pool is external, all the data needed for input into the algo (with the exception of PR) is available in the inverted barrels IMO.
Since Google does not evaluate all the results in a barrel (stopping when they have a reasonable number of results for the ranking pool) there is a different preference if you want to rank well :
listed at the top N of the order in the short barrels
listed at the top N of the order in the long barrels.
As an example if Google are still using 40,000 results as a sufficiently large pool and preferring the results from the short barrels if you are not listed above 40,000 in the short barrels you simply don't stand a chance of being considered for ranking.
PhilC
04-10-2005, 04:55 AM
I think that risk probably meant the ranking algorithm (program).
Do you think that the data in the barrels is sorted in some way other than wordID order, Mel?
It seems only logical that they would have to have some sort of ordering which at least puts those with the highest hits at the top of the list, otherwise if they are ordered only by docID then the new pages on popular terms might be beyond the threshold number of how many pages are selected for the ranking pool.
The Anatomy paper makes an oblique reference to this problem:
An important issue is in what order the docID's should appear in the doclist. One simple solution is to store them sorted by docID. This allows for quick merging of different doclists for multiple word queries. Another option is to store them sorted by a ranking of the occurrence of the word in each document. This makes answering one word queries trivial and makes it likely that the answers to multiple word queries are near the start. However, merging is much more difficult. Also, this makes development much more difficult in that a change to the ranking function requires a rebuild of the index. We chose a compromise between these options, keeping two sets of inverted barrels -- one set for hit lists which include title or anchor hits and another set for all hit lists. This way, we check the first set of barrels first and if there are not enough matches within those barrels we check the larger ones.
but the probem here is that they do not tell us which compromise they used. At first glance it appears that the use of two sets of inverted barrels is the compromise, but that cannot be right because it still leaves the ordering system as an open question.
jfbruns
04-11-2005, 11:47 AM
Hi,
I have also often asked myself if Google sorts the word lists by page rank or by docId. The latter is easier to intersect if you have an AND query with more than one word.
I found the following document (http://www.computer.org/micro/mi2003/m2022.pdf) where some Google engineers explain the architecture. On p. 2 right column under the picture it says:
"The index servers then determine
a set of relevant documents by intersecting the
hit lists of the individual query words, and
they compute a relevance score for each document.
This relevance score determines the
order of results on the output page."
My understanding is that they fetch a set of documents (must not be all matching ones) and then they calculate the relevance for each of it. At least the passage says that the documents are not presorted in ranking order.
I do not know if I am right and do not claim that, but that document could give a clue.
Another thing is that the Page / Brin document about Google may not be up to date anymore.
jfbruns
Its good to realise that in IR systems all data is passed through barrels to be sorted and filtered before going through to the next and so on until the system cut-off point is reached. Some call it drilling-down like ask.