PDA

View Full Version : The Search Engine Relevancy Challenge


rustybrick
05-03-2005, 08:47 AM
The project is live at The Search Engine Relevancy Challenge (http://www.rustybrick.com/search-engine-challenge.php).

Please read the instructions and please do not use this engine as you would to check rankings. It is important to use it for an "information search" and not a "rank checking", as you might imagine.

I am off to Toronto SES in a few hours. I will check back here for comments and feedback.

Qal
05-03-2005, 09:37 AM
Tried 3+ keywords and I'm sure yahoo would be the deserving winner! Expect a colossal uproar for this nice little tool. :)

St0n3y
05-03-2005, 12:33 PM
The fact that results are numbered and in vertical alignment might tilt the results. There is a basic assumption (intended or not) that the higher up the more relevant a result is. I would recomend a grid layout, say three rows of three.

rustybrick
05-03-2005, 10:07 PM
Can you explain in greater detail? I am a bit tired from the drive to Toronto.

St0n3y
05-04-2005, 12:53 AM
I think that people will have a natural tendancy to rate the #1 listing as more relevant than the #7 listing. Perhaps unconciously because we are trained to believe that the more relevant results are at the top, even though in this case it certainly may not be true.

My thought is if you display the results differently than we are used to the mind won't make such assumptions. Display them tabled three accross and then stacked down for a total of 9 or 12 on the page. No numbers, just bullets for each result. This should help eliminate any bias of #1 is obviously more relevant because its #1.

Make sense?

Qal
05-04-2005, 12:58 AM
Yea, that sounds more correct. For all my 3 searches, the top result was the most relevant, just as usual.

rustybrick
05-04-2005, 07:50 AM
I thought people would be judging the 1st result more carefully. Meaning, should this result really be the #1 result, is it the most relevant result I can have out of all the possible web pages out there?

I guess that wasn't clear.

St0n3y
05-04-2005, 10:59 AM
But I assume that the search is pulling #1 listings randomly from the top 4 engines mentioned. IMO by numbering them as you do you are in fact suggesting that whichever listing is in the number one position SHOULD be more relevant than all the others. For the purpose of this test, I don't think such assumptions (intended or not) should be made. You want to make a level of a playing field as possible.

In the real coke v. pepsi challenge, they would make sure that no additional preference is given to either in the testing phase. If they, say, slightly elevated one sample, or changed the container a sample was in, either of those things could and would skew the results, even if those things were done randomly to both products.

I think in order to get the best results there needs to be a presentation that does not make any conscious or unconscious suggestions.

rustybrick
05-04-2005, 12:59 PM
Ok, I'll try to have that changed soon., I am in Toronto now, so it might take some time.

rustybrick
05-06-2005, 01:23 PM
I have posted some early results at http://www.seroundtable.com/archives/001900.html

Yahoo! is currently in the lead!

Chris Boggs
05-06-2005, 02:37 PM
(copied from my post in the Roundtable Blog)

Very cool! I did about 5 searches, and found that "search engine optimization" has the most 5's, in my opinion. I guess that is good. I am surprised at the dramatically lower number of 2, 3, and 4's, since I used a fair amount of them myself. Some sites I rated as 1's or 2's were done so because they were the second results from the same site, which has always bothered me. Any site that required a login for more information received a 1 from me. I found a lack of relevant results for the term "diamond pricing guide," with only about three fitting the bill, and the rest simply some dealer's page. It is my opinion that terms including "guide," "review," or "comparisons" should lead to objective sites. Interesting that MSN led the pack in 1's and came in last in 5's...I guess they do have some work to do...thanks again Barry!

Mikkel deMib Svendsen
05-06-2005, 03:11 PM
Thanks Rusty! Someone had to do this in public sooner or later. It has been done for years in "secret" by a few of the major portals and I've seen the trended results. Very interesting. Good that we finaly get a public version.

As Danny has pointed out before: If the engines won't come up with a good metric for relevancy and start making it public someone else will - and you did, Rusty. Thanks for that. Maybe that will kick some sense into the engines.

rustybrick
05-06-2005, 04:42 PM
I will post a page with the live results shortly.

I just want to clean up the graphs a bit.

Any other feedback you want to see from this data?

I have query terms and IP addresses. I am thinking about adding unique ratings (based on IP) and unique queries. Plus some cool flash graphs.

Chris Boggs
05-06-2005, 04:56 PM
I will post a page with the live results shortly.

I just want to clean up the graphs a bit.

Any other feedback you want to see from this data?

I have query terms and IP addresses. I am thinking about adding unique ratings (based on IP) and unique queries. Plus some cool flash graphs.

maybe a ranking based on the number of words in the queries...such as Google performs best for one-word phrases, etc? no bother if it's too much work...I know it's probably too small of a sample to accurately rate this.

rustybrick
05-06-2005, 05:48 PM
Keep the suggestions coming...when we get more data, i can plot it.

Qal
05-07-2005, 01:43 PM
'Early Results' of this Test shows Yahoo to be the most relevant, as I expected. I always knew yahoo was better than google, but if you include the description in the test, Google would take the crown. Google's Descriptions are just perfect.

I believe Yahoo would be the ultimate winner followed by MSN in the final results. Lets see... :)

rustybrick
05-08-2005, 09:42 AM
This test does not support any illegal betting. :D

Qal
05-08-2005, 10:18 AM
This test does not support any illegal betting. :D

LOL! Thats my strong assumption, nothing more! :p

orion
05-11-2005, 03:42 PM
In my opinion, relevancy has a lot to do with perception.

1. Which content is relevant according to user's perception?

2. Which content is relevant according to scoring functions used by a machine (IR system or search engine)?

3. Which documents scored and already prequalified as relevant by a search engine algorithm are actually relevant according to user's perception and to the query that has been used?

I believe rustybrick exp has valid merits but is an attempt at answering Q#3 more than to try to answer #1 or #2 since the starting material of the experiment has been already scored by the search engine. So, any outcome of the experiment is influenced by the documents presented to the user.

These exps have been discussed extensively in the IR literature (See Modern Information Retrieval, Chapter 3). This other thread expands on issues related to Q#3 >>> The relevance of "relevance" (http://forums.searchenginewatch.com/showthread.php?p=46058#post46058)


Orion

rustybrick
05-12-2005, 05:25 PM
Yahoo! is looking hotter each day...I guess Yahoo! has more employees then the other three (hmmm, Microsoft?).

New results at RustySearch Hits 5,000 Rated Searches (http://www.seroundtable.com/archives/001928.html).

Qal
05-12-2005, 10:27 PM
I always believed Yahoo was the best, however, as I mentioned earlier, if you count the descriptions (which isn't feasible in this test), Google would lead with a superior margin. Its a subjective as well as a relative issue. :)

Anyway, I never thought Ask would oust MSN. To be honest, I last used it 2 months back, I'll HAVE to check it out now. ;)

orion
05-13-2005, 12:53 PM
I'm not sure why some at this thread keep refering in terms of "MSN or ASK, or Yahoo is better or look hotter than this or that engine" when indeed this test does not test system relevancy but user perception of which system appear to produce more relevant results to them. Two different things.

As mentioned before, these type of tests, are pretty much standard in terms of precision vs recall curves and EM measures. These profile curves, often describe an inverted logistic behavior and produce a more realistic picture rather than mere absolute numbers. My two cents.

Orion

Mikkel deMib Svendsen
05-13-2005, 12:59 PM
Orion, I think it is because you basically have misunderstood the whole idea about this test. It has nothing to do with science, IR or semantics - it's all about the only thing that really count to user: How well do they like what they get. From that user percpective the science behind is completely irrelevant. As I user I don't care if Googles does, this, if Yahoo have this Patent, if MSN do so or so or if there can be good arguments for or agaianst any way of building a good search engine. All I am interested in is what I get - the way I see it.

This is not about science - it is about marketing, usability and dempgraphics. This is exactly what many of us has been asking to get for many years and I believe I am not the only one that welcome it. Don't critizise it for what it is not :)

orion
05-13-2005, 01:10 PM
Orion, I think it is because you basically have misunderstood the whole idea about this test. It has nothing to do with science, IR or semantics - it's all about the only thing that really count to user: How well do they like what they get.

1. Simply not true, Mikkel. These type of tests are pretty much standard procedures.

2. Second, I don't criticize the test and pretty much welcome these type of tests. I do criticize in good faith comments of some of you.

Cheers

Orion

rustybrick
05-13-2005, 01:16 PM
All we are testing is which search engine the users feel are the most relevant. Nothing more, nothing less. :p

Mikkel deMib Svendsen
05-13-2005, 01:18 PM
Simply not true, Mikkel. These type of tests are pretty much standard procedures.

I don't know why you want to take that route. I guess it's my turn then to say: "No, it's not true Orion" - but I just don't see how that can benifit anyone, so I don't :)

As far as I can see you are still looking at this test with the wrong set of glasses. Peoples subjective "feelings" about what they like the most is exactly what we want to know. I am not interested to know if that is indeed correct or not but only what the users actually feel is most relevant to them.

Other kinds of tests are fine and we need them. I am not saying the kind of tests you do is not valuable. They sure are. But they are not the same as this.

Mikkel deMib Svendsen
05-13-2005, 01:21 PM
All we are testing is which search engine the users feel are the most relevant. Nothing more, nothing less.

Exactly! I don't know why I needed so many words to explain that :) I guess, thats what makes you a writer and I not (at least in Ennglish hehe)

rustybrick
05-13-2005, 01:25 PM
Exactly! I don't know why I needed so many words to explain that :) I guess, thats what makes you a writer and I not (at least in Ennglish hehe)

(1) I believe, Orion's concern is that people are thinking that this test defines which engine is the "best engine." So he is right in describing that this test does not define the "best" or all cases of the most "relevant" search engine. I'll say it again; This Test Only Shows Which Search Engine (of the 4) is the Most Relevant, Judged by the Individuals Rating the Search Engines". So its basically, which search engine you and I feel is the most relevant.

(2) I like to classify my writing as "blogging". I am by far not a writer or editor, like Danny.

orion
05-13-2005, 01:30 PM
Rusty and Mikkel

This is not about "I'm right you are wrong". Again, these type of tests (user perception of relevancy tests from pre-scored machine results) are well documented. I'll be happy to provide you with some reference material.



Orion

andrewgoodman
05-17-2005, 12:21 PM
I don't know why you want to take that route. I guess it's my turn then to say: "No, it's not true Orion" - but I just don't see how that can benifit anyone, so I don't :)

As far as I can see you are still looking at this test with the wrong set of glasses. Peoples subjective "feelings" about what they like the most is exactly what we want to know. I am not interested to know if that is indeed correct or not but only what the users actually feel is most relevant to them.

Other kinds of tests are fine and we need them. I am not saying the kind of tests you do is not valuable. They sure are. But they are not the same as this. Allow me to unpack that slightly and zero in on a single word in your post, Mikkel: "people's subjective feelings."

Which people?

And on which types of queries?

Last night I tried to find a search result for a small technology firm from Western Canada who did some laughable high-tech demo that turned out to be an ordinary high-definition TV rather than the new streaming technology it claimed to be. It was some kind of securities fraud case. Anyway, I must have performed 40-50 searches and came up empty. So, on that search, I was dissatisfied with all search engines.

On my search for the band "The The," I was satisfied with Yahoo and dissatisfied with Google because Yahoo put the band in #1 spot and Google didn't have it at all due to the "the" problem.

But that's very personal and subjective. Is it *really* that interesting what I felt? At the very least you would need a very large sample size to get to the point where all of those highly subjective, highly granular experiences started to make sense.

As we know, probably all the main SE's are running these search quality tests as well. They'd be crazy not to.

Running a large-scale test with a bunch of ordinary users is probably somewhat interesting. What excites us all so much about search, though, is its granularity. Those in the industry, probably good reasons, are less interested in Zeitgeist type common denominator numbers than in ways of improving how well the major SE's help us find hard-to-find info. And also about whole new ways of structuring information. Users probably appreciate being prompted with news results or shopping results (invisible tabs), and I think it would be interesting to ask them if those specific *aspects* of the search were helpful to them as opposed to just asking them if they "liked the result."

I agree that these kinds of tests may be interesting pilot studies, but there are now and have always been drawbacks with attitude testing and market research. It can tell us, for example, to design things for unsophisticated people from five years ago. I'm sure AOL for example got huge thumbs-up from most of its users in 1995, but someone thought it sucked and kept pushing on to compete with them, to provide something different for a more forward-thinking audience.

Like/don't like don't tell us enough. Not to take anything away from the effort, which is laudable.

Can we go back to square one and ask what the goal of the experiment is? To determine which SE is the most relevant? Maybe in the current context, the top four engines are so close in quality that it has become a less interesting and nuanced question to ask than it was six or seven years ago. There were periods where Excite and Altavista results might have been spam-ridden and hopeless, and along came Inktomi to save the day, and then, Google, etc. Teoma, Yahoo, and MSN are simply trying to out-Google Google -- a much less interesting competition. They are all doing a pretty good job of keeping close, IMHO, but not different or better enough to matter. And as soon as one engine gains ascendancy, spammers target it even more, and the terrain shifts again. It strikes me that organic results have not been just in a struggle for improved relevancy over the past 6-7 years, but in a struggle to make them relatively spam-resistant.

Mikkel deMib Svendsen
05-17-2005, 01:18 PM
You are certainly right that this is not a perfect test but if it gains momentum and enough different people contribute then the results might be interesting. Also, this is the first public test of it's kind and as such I will welcome it with whatever limitations it might have.

The privately held test I've seen in the past have all been with a smaller, but controled, group of people making it much more valuable if you know what your target demographic is.

rustybrick
05-17-2005, 01:24 PM
At this point we have over 7,660 rated search results. I really want 20,000.

Future plans include:

- Rating the smaller search engines with the big (should be more interesting)
- Rating news search engines
- Rating shopping search engines
- Rating image search engines
and so on.

This can be very fun and I know the search engines appreciate the feedback. People like you and me, simply find the data interesting (to say the least).

Qal
05-18-2005, 09:35 AM
Barry,

Why do you always miss the 'metasearch engines'? ;)

Comparing Metasearch engine with Traditional Search Engine SERPs would be a good idea too. :)

I, Brian
05-18-2005, 09:58 AM
these type of tests (user perception of relevancy tests from pre-scored machine results) are well documented. I'll be happy to provide you with some reference material.

Some reference material on how well users rate the relevancy of Google, Yahoo and MSN *at the moment* would be great. Isn't that what Barry is doing?

rustybrick
06-17-2005, 12:16 PM
Alright, I need about 150 more rated searches to hit the 10,000 mark.

Then I publish the results excluding IP address information.

Chris Boggs
06-17-2005, 01:35 PM
k i just rated 30 more (3 more phrases)...I'll leave #10k for someone else. I do like this...can't wait for the results!

rustybrick
06-20-2005, 03:33 PM
Final results, with raw data and all can be found at RustySearch Results 10,000 Mark (http://www.seroundtable.com/archives/002101.html), also the Search Engine Relevancy Challenge (http://www.rustybrick.com/search-engine-challenge.php) page has all the updates at the bottom.

Chris Boggs
06-20-2005, 04:16 PM
very cool. I was a little disappointed to find that the results shown were all from one SE as opposed to mixed up in the rankings, though. perhaps the next time you can make it randomly use different SE results within one page?

Great test, though. Looks like Yahoo shines brightly today, and MSN has some work to do.

Chris Boggs
06-20-2005, 04:34 PM
I had to laugh when I saw some of the search terms:

"tall gay men" followed by "tall women"

"cocktail waitress photos"

"liberal lunacy" (must have been a neo-con :P)

"surviving a cult"

"yahoo booters bling" ????

"air force ones" (I gets to stompin in mine! all white hightops of course)

perhaps the best: "how many heroin addicts are there in dublin?"

"kill fish mercifully/quick"

"is poison ivy contageous"

martinuboo
06-20-2005, 04:41 PM
Very well done Barry! :)

I agree with Chris, I would like to see the SERPs made up of random results from multiple engines.

I'm not surprised by the ranking of the SEs, but I am surprised they are as close to each other, but then again the 1 - 5 scale makes them seem closer.

Thanks for a great test!

martin

rustybrick
06-20-2005, 04:52 PM
The reason the SERPs were NOT "made up of random results from multiple engines" was because I would need to do too many API calls (queries) to make that happen. It was the quick way to do it, in an efficient manner (in terms of query usage).