PDA

View Full Version : Stop words effect on results


Jonas
10-05-2004, 04:07 PM
Google clearly states that stop words are not included in searches. Still including or leaving out a stop word delivers different results. For example god at tennis (http://www.google.com/search?hl=en&lr=&ie=UTF-8&q=good+at+tennis) and good tennis (http://www.google.com/search?hl=en&lr=&ie=UTF-8&q=good+tennis) obviously gives different results. Seems to be the same with most searches. On the other hand some searches like the famous to be or not to be (http://www.google.com/search?hl=en&lr=&ie=UTF-8&q=to+be+or+not+to+be) is the same without the stops.

So why and how does the stop word affect the result? What different is a stop word compared to a regular word, in terms of results generation?

Marcia
10-05-2004, 04:39 PM
The proximity isn't the same between the two. Despite the stop word, with one way they're right next to each other and the other way, with the stop word in there in between, they aren't right next to each other.

Like pound cake is not the same as pound of cake.

Jonas
10-05-2004, 04:56 PM
Thatīs a bit weird but it explains it, so thanks for the insight.

I still want to interpret the whole idea of "stop word" though, as if the word is excluded from the search. But this means that itīs only excluded in terms of checking the word against the index, but itīs still physically there, taking up space in the query? Odd. But hey, itīs search engines.

hinote
10-06-2004, 04:23 AM
Hi, what is the definition of a stop word.?
Why is it called 'stop'.?

Jonas
10-06-2004, 04:30 AM
Basically they are very common words like "the" and "in". Since they carry no significant meaning but would take up enormous amount of space in indexes and query traffic, they are simply excluded from the search. I don't see an obvious reason for the naming of the term though. Maybe it's simply because the words get stopped? Or perhaps it's some database term borrowed used in wider (and older) contexts than just web search.

ppg
10-06-2004, 04:54 AM
I don't know how recent a change this is, or even if its a change at all, but G is differentiating between stopwords to some extent now.

I personally don't remember this happening before, but a search on say flights to spain returns slightly different results than flights at spain, which returns the same results as flights it spain.

Also, the 'incorrect' searches will sometimes bring the 'did you mean flights to spain?' message.

So although G still says the stop word hasn't been used in your search, it seems that isn't entirely true.

Lance Housley
10-06-2004, 05:22 AM
On the other hand some searches like the famous to be or not to be (http://www.google.com/search?hl=en&lr=&ie=UTF-8&q=to+be+or+not+to+be) is the same without the stops.
Marcia's point about proximity is spot-on, and also explains why to be or not to be works the same with or without the stopwords. ALL the words in that query (with the exception of NOT) are stopwords, so whether you type the stopwords or not, Google is conducting a single-word search and proximity is not relevant.

It's worth noting a couple of points, though:

Stopwords are not ignored in phrases with "quotation marks" round them. So a search for "to be or not to be" (http://www.google.co.uk/search?&q=%22to+be+or+not+to+be%22) really does search for all the words in the phrase.
If you place a plus sign directly in front of a stopword, Google will search for it anyway. So good +at tennis (http://www.google.co.uk/search?&ie=UTF-8&q=good+%2Bat+tennis) does give different results from good at tennis (http://www.google.co.uk/search?&ie=UTF-8&q=good+at+tennis) and good tennis (http://www.google.co.uk/search?&ie=UTF-8&q=good+tennis) - and since Google brings to the top those results which give your search terms as a phrase even if you don't use quotation marks then you'll find that good at tennis does figure in your top results.

You can also use the plus sign to tell Google to accept your input terms just as they are. What do I mean? Well, for the last few months Google has been using implicit word stemming, so if you search for long distance running (http://www.google.co.uk/search?q=long+distance+running) Google actually looks for runner and runners (and maybe some other words) too. If you really want to search only for the version of the word you type, you can specify that by preceding the word with a plus sign - and that is a reason you might want to use a + even inside a phrase.

Jonas
10-06-2004, 07:05 AM
Thanks all! Now I get it. Except though for the new issue raised by ppg: "a search on say flights to spain returns slightly different results than flights at spain". In counting words and characters, the proximity is identical. So whats the difference?

Lance Housley
10-06-2004, 07:52 AM
The fact that Google may disregard words (stopwords) when it is compiling its results, does not mean that it will necessarily disregard them when it is ranking them in order to display on the results pages. You may notice that a search for flights at spain triggers a typical Google response
Did you mean: flights to spain
as well as the standard stopword response
"at" is a very common word and was not included in your search
and if you accept the suggestion of flights to spain Google still tells you it is ignoring the to.

There is also the possibility that you get different results because you happen to have queried different Google databases - you will know that Google has datacentres around the world, and you could get connected to any of them, depending on load at the time you click the SEARCH button. Athough the indexes at each datacentre are largely the same, since updating goes on continuously there can be some slight discrepancies between centres - though nothing like as much as Altavista used to give.

ppg
10-06-2004, 08:02 AM
>So whats the difference?

The stopword, thats my point.

try
hotels in venice
hotels at venice
hotels it venice
hotels the venice

all produce slightly different results.
I always thought the stopword didn't make a difference, and G just used:
keyword stopword keyword
Must be some semantics jiggery pokery going on.

Mikkel deMib Svendsen
10-06-2004, 08:08 AM
Must be some semantics jiggery pokery going on.

From a semantic point of view an exact match will alwyas score higher than a close match - or a match filtered for stop words. I think this may also play a role.