In December, my In Search Of The Relevancy Figure article called for search engines to get beyond the hype of who is biggest or freshest and develop a commonly-accepted means of measuring actual relevancy. In it, I wrote of third-party tests that had been commissioned in the past to get at this. Now, the first such third-part test like this in ages has been done.
VeriTest was commissioned by Inktomi to conduct the test. It found that in raw scoring (where URL position wasn’t taken into account), Inktomi came out tops — but just barely. Inktomi earned 1630 points, with Google just behind at 1597. That’s so close that I’d essentially consider the services tied. Behind the leaders came, surprisingly to me, WiseNut at 1277, followed by Teoma at 1275, AltaVista 1222 and AllTheWeb at 1173, another big surprise for coming in last.
Critics will immediately assume that since Inktomi commissioned the test, it would naturally be in Inktomi’s favor. Google itself suggested as much along with others in a recent WebmasterWorld thread.
“It helps a lot to pick the ground rules, what queries to throw out,” Google posted to the thread, pointing out that past tests conducted by the VeriTest on behalf of AltaVista and Ask Jeeves found those services had ranked tops. Not mentioned by Google was that a past test also commissioned by the firm on behalf of Google in September 2000 found — wait for it — Google to be the most relevant.
So is it really just whomever pays for the test gets the best ratings? Not exactly. About two years ago, I moderated a panel involving VeriTest (then known as eTesting Labs). It turned out that some search engines had funded tests where they were NOT found to be the best. In these cases, they didn’t allow the results to be publicly released.
Absolutely, one needs to be critical of any report funded by only one company. All the more reason why I’d hope the search engine industry as a whole would get behind a common set of tests. Let them all pick the “ground rules” and agree that results, favorable or not, will be published for everyone.
As for this particular test, either I or Chris Sherman will likely do a detailed review of it in the near future. But in the meantime, here are a few more details.
There were 100 queries randomly selected from a set of 1 million real ones provided by Inktomi’s search logs. The top 10 editorial results for each of the search engines tested were reviewed. Sponsored listings were not counted. Three judges then reviewed each of the URLs to determine yes or no whether they were “acceptable” relevancywise in relation to the query terms. A raw score as well as two weighted scores based on the position of URLs were then calculated.
From my preliminary review, the main criticism of this method is the “binary” choice of saying whether a document is relevant. Consider a search for “cars.” Any number of pages about cars in some way could be considered relevant. However, how do you know if these are the best documents of the entire set of those possibly relevant? A binary test doesn’t measure this.
To be fair, the test did try to address some nuances of quality. Judges were told to mark only pages they considered using their own judgment to be “excellent” or “good” to be “acceptable” relevancywise, while pages deemed only “fair,” “poor” or other criteria would be rejected. Judges were also told to ask themselves questions such as, “If a friend of mine was interested in the subject of this query, would I email them this URL?,” among others.
Search Headlines
NOTE: Article links often change. In case of a bad link, use the publication’s search facility, which most have, and search for the headline.