Search Engine Watch
SEO News

Go Back   Search Engine Watch Forums > General Search Issues > Search Technology & Relevancy
FAQ Members List Calendar Forum Search Today's Posts Mark Forums Read

Reply
 
Thread Tools
Old 05-08-2008   #1
MissOpie
Member
 
Join Date: May 2008
Posts: 6
MissOpie is on a distinguished road
Anaphora, complex stemming, etc...

Hi,

I was reading this thread and wondered what you guys thought about it.

It's relatively basic but interesting, covering the potential use of anaphora resolution in Google and the "stemming of complex plurals".

Just wondered what your thoughts were on that.

MissOpie x
MissOpie is offline   Reply With Quote
Old 05-08-2008   #2
jimbeetle
 
jimbeetle's Avatar
 
Join Date: Mar 2006
Location: New York City
Posts: 997
jimbeetle has much to be proud ofjimbeetle has much to be proud ofjimbeetle has much to be proud ofjimbeetle has much to be proud ofjimbeetle has much to be proud ofjimbeetle has much to be proud ofjimbeetle has much to be proud ofjimbeetle has much to be proud ofjimbeetle has much to be proud of
Re: Anaphora, complex stemming, etc...

A lot of good thoughts in that thread and much to think about going forward. But it did get quite off track from the initial stemming of complex plurals topic (which many such threads are wont to do).

Anaphora resolution is basically associating a pronoun with it's precedent. The wiki article gives a good example of its purpose and also an example of the inherent complexity:

Quote:
We gave the bananas to the monkeys because they were hungry.
We gave the bananas to the monkeys because they were ripe.
Can't say that this won't happen at one point, though my understanding for now is that machine language analysis isn't quite there yet.

A bit more interesting to me at the moment from that thread is Robbo's example passage:

Quote:
France is a beautiful country. That country is in western Europe and it is controlled by farmers. The republic has huge wine lakes and butter mountains but recently it has decided to free up some of the bureaucracy of that nation.
Putting pronouns aside for now, we can come up with:

Quote:
France related to County related to Republic related to Nation
And, if the entry were extended a paragraph or two, we might see something like:

Quote:
France related to French
French related to Language related to Culture related to Beret
Quite a few folks think that to a certain extent this is already implemented, myself among them. My mind is a bit muddled at the moment, but I'm sure somebody here can cite papers or patents relating to it.
jimbeetle is offline   Reply With Quote
Old 05-08-2008   #3
mcanerin
 
mcanerin's Avatar
 
Join Date: Jun 2004
Location: Calgary, Alberta, Canada
Posts: 1,564
mcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond repute
Re: Anaphora, complex stemming, etc...

I'm not so sure that is what is happening. I can't say it's not, of course, but the problem with linguistic methods is.. language. Since there is obviously more than one language on the web, linguistic analysis is more limited than you may think.

What I do think that might be very possible, even probable, is a lookup table based on related words.

Think about it - stemming only gets you so far (fishes to fish). But if you had a lookup table for certain words, you could very easily combine a great many words into a single "concept".

An example used in the thread that was referenced earlier was "Nitrous Oxide" and NOS. How about search engine optimization and SEO as another example?

But it would be simple to find related terms - Google has access to Adwords data, term vectors, a huge database of sites, dictionary (the define: command) and no doubt many other sources of information.

It would not be difficult to say that if someone searched for "search engine optimization", that they should also check "SEO" and "search engine optimisation", too. It would probably be more efficient than trying to analyse pages and queries linguistically. Just do a lookup. Google is good at lookups.

Think about Google translate. Take a very close look at how it works. It starts by doing a standard translation. But then it adds phrase lookups based on both user input and on documents that it knows are identical but translated. The result is far better than a dictionary or linguistic based translation wherever the human translation is available.

The wonderful thing about lookup tables is that they are multi-lingual - it doesn't matter what language is being used - it's just a lookup.

If I were them, I would then assign a weighting to the lookups, where an exact match was given more weight than an inexact match, and a popular inexact match given more weight than an unpopular one.

Some concepts that would work great with lookups but not with stemming or linguistics:

Las Vegas, Vegas, LV, Los Vegas
SEO, Search engine optimisation, Search engine optimization
Woman, female, babe, lady
Man, male, guy, dude, gent

And so on.

I would suggest that Occams Razor would support a simple lookup based on Google vast database of keywords and searches, rather than language specific, on-demand linguistic analysis of every search and page.

It would also deal with common spelling mistakes fairly easily - let linguistic analysis try that!

Ian
__________________
International SEO
mcanerin is offline   Reply With Quote
Old 05-08-2008   #4
jimbeetle
 
jimbeetle's Avatar
 
Join Date: Mar 2006
Location: New York City
Posts: 997
jimbeetle has much to be proud ofjimbeetle has much to be proud ofjimbeetle has much to be proud ofjimbeetle has much to be proud ofjimbeetle has much to be proud ofjimbeetle has much to be proud ofjimbeetle has much to be proud ofjimbeetle has much to be proud ofjimbeetle has much to be proud of
Re: Anaphora, complex stemming, etc...

Quote:
I'm not so sure that is what is happening. I can't say it's not, of course, but the problem with linguistic methods is.. language. Since there is obviously more than one language on the web, linguistic analysis is more limited than you may think.
Of course, that's why I think the first instance, though it might be being strived for, is a bit of a way off. It might come in time.

Quote:
What I do think that might be very possible, even probable, is a lookup table based on related words.
Yes, again of course, but I think your examples might be a bit limited. As in my second example, these relationships can be extended, say, from "France" to "French" through to "Beret." It isn't just synonyms but the co-occurrence of terms. Analysis might show that "Las Vegas, Vegas, LV, Los Vegas" are all somewhat related to "casino, gambling, blackjack, roulette," or "convention, trade show, exposition."

I do think that this is where things are now. We might not see "Beret" bolded in the snippets when we search for [France], but I'd (almost) bet that if both terms occurred on the same page a bit more weight would be given to the occurrence of "France."
jimbeetle is offline   Reply With Quote
Old 05-08-2008   #5
mcanerin
 
mcanerin's Avatar
 
Join Date: Jun 2004
Location: Calgary, Alberta, Canada
Posts: 1,564
mcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond repute
Re: Anaphora, complex stemming, etc...

Quote:
I do think that this is where things are now. We might not see "Beret" bolded in the snippets when we search for [France], but I'd (almost) bet that if both terms occurred on the same page a bit more weight would be given to the occurrence of "France."
I *would* bet that - it's a separate issue called term vector analysis, and I've been using it very successfully for a couple of years now.

The problem is, that the only way to talk about it easily with non-statisticians (which includes me) is with pictures and graphs (or a drawing on the back of a napkin, which is how Dr Garcia first explained it to me and one other guy you probably know a couple of years ago.)

This means I don't discuss it in forums - that last time I tried it was a disaster. Half the people thought I was crazy, and the other half thought they'd discovered the holy grail of spam. It's neither, and as Dr. Garcia will quickly point out, it is very difficult to do if you are not a search engine. My version is simply "good enough" and better than what my competition uses, but I don't pretend it's perfect.

As for details, it is safe to say that if you type a search term into Google, they can quickly bring up ALL the major terms in every document in their inventory and compare them very quickly. Terms that constantly show up in "natural" texts are then looked for as indicators of a particular document being both on-topic and non-spammy.

It's not just things like "poker" being related to "Las Vegas", it includes the number "800" being a term vector for an online shopping portal. Any guesses why?

Although TVA is a different issue, I do suspect that TVA is a good source for some of those lookup terms. I know I'd look there, if I was a search engine programmer.

Ian
__________________
International SEO
mcanerin is offline   Reply With Quote
Old 05-09-2008   #6
MissOpie
Member
 
Join Date: May 2008
Posts: 6
MissOpie is on a distinguished road
Re: Anaphora, complex stemming, etc...

I think that statistical machine translation, translation using parallel texts, have produced less than adequate translation. in french I recently had "server" translater as "waiter.

I think a lot more work is now being done on natural language techniques. My view here is that language does not abide by a set number of rules, it's more fluid than that.

Lookup techniques use up a lot of space and aren't flexibe enough in my opinion to deal with things like natural language queries for example, which are seeing a lot of research activity at the moment (i'm not saying this is going to be used, just that there is activity).

I do know that there is interest in natural language generation also, hugely difficult problem that one.

Personally of course I'm not sure what's going on, but it's always interesting to chat about when something worth some banter pops up.

MissOpie is offline   Reply With Quote
Old 05-09-2008   #7
jimbeetle
 
jimbeetle's Avatar
 
Join Date: Mar 2006
Location: New York City
Posts: 997
jimbeetle has much to be proud ofjimbeetle has much to be proud ofjimbeetle has much to be proud ofjimbeetle has much to be proud ofjimbeetle has much to be proud ofjimbeetle has much to be proud ofjimbeetle has much to be proud ofjimbeetle has much to be proud ofjimbeetle has much to be proud of
Re: Anaphora, complex stemming, etc...

Ian, yes, much too complicated for folks not Dr. Garcia to try to talk about in forums. But it does have to be discussed to a certain extent just so folks know there are things going on other than the obvious. The things I mentioned are more along the lines of what I try to keep back of mind as I work, or as I try to understand why I can't understand certain SERPs.
jimbeetle is offline   Reply With Quote
Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Similar Threads
Thread Thread Starter Forum Replies Last Post
Optimizing A Complex Site Dami Search Engine Optimization 3 07-12-2005 06:29 PM
I'm a searcher with a questions about how stemming works katiel Google Web Search 7 06-13-2005 03:06 PM
Secure and/or complex URLs kavinski Search Engine Optimization 4 08-18-2004 08:28 AM
Domain strategy - too complex? Catamarans Google Web Search 2 07-15-2004 03:22 PM
KeyWord Stemming & Word Forms Red5 Search Technology & Relevancy 27 06-29-2004 04:27 PM