Search Engine Watch
SEO News

Go Back   Search Engine Watch Forums > Search Engines & Directories > Google > Google Web Search
FAQ Members List Calendar Forum Search Today's Posts Mark Forums Read

Reply
 
Thread Tools
Old 02-06-2005   #121
xan
Member
 
Join Date: Feb 2005
Posts: 238
xan has a spectacular aura aboutxan has a spectacular aura about
Quote:
Originally Posted by Nacho
Thank you Jazar!

This is how I believe the picture would paint:

I don't see why you should worry about LSI at all. From what I'm getting in this thread is a confirmation that it is to computationally expensive for any search engine to implement such technique in 8 billion documents.
Nicely done
xan is offline   Reply With Quote
Old 02-06-2005   #122
Robert_Charlton
Member
 
Join Date: Jun 2004
Location: Oakland, CA
Posts: 743
Robert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud of
Quote:
Originally Posted by jazar
Orion has kindly left a summary of the scientific methodology on an other thread - this starts like this:

1. Gather observations of a phenomenon.
2. Based on the observations, formulate a hypothesis to consistently explain the phenomenon.

You are moving to step 2 Robert, before validating step 1 ...
jazar - I have a college background in theoretical math and physics... long left behind... but it's given me some clues about the scientific method. I don't think that the language I was using ("I'd bet..." "I'd doubt") would be confused with the language of a scientific hypothesis.

Rigorously testing a hypothesis often is not possible in SEO anyway, and the scientific method often runs into some limitations here. It is important, though, to separate opinion from fact.

Let's say that I was just tossing out an observation and some thoughts that might be useful for others sitting around the table to use in building their own theories.

Last edited by Robert_Charlton : 02-06-2005 at 05:38 PM.
Robert_Charlton is offline   Reply With Quote
Old 02-06-2005   #123
jazar
Member
 
Join Date: Sep 2004
Location: London
Posts: 39
jazar is an unknown quantity at this point
ok, sorry robert, didn't want to sound patronising in any way.

Quote:
Phrases are synonyms too, you have to be careful as you narrow down the terms, stripping out one term takes away the phrase too.
Same for ~mortage, removing cards is the same as removing credit. co-occurence factor is high within the ~mortgage "circle". Worthless to calculate the co-occurence factor outside of the circle then?
jazar is offline   Reply With Quote
Old 02-06-2005   #124
Everyman
Member
 
Join Date: Jun 2004
Posts: 133
Everyman is a jewel in the roughEveryman is a jewel in the roughEveryman is a jewel in the rough
I think Google's algorithm works like this:

Everyman is offline   Reply With Quote
Old 02-06-2005   #125
Mike Grehan
Member
 
Join Date: Jun 2004
Posts: 116
Mike Grehan is a name known to allMike Grehan is a name known to allMike Grehan is a name known to allMike Grehan is a name known to allMike Grehan is a name known to allMike Grehan is a name known to all
Latent semantic indexing.

Guys,

I don't have time to read the whole thread here. But if you want to know about latent semantic indexing, I wrote about it (for those who have a copy of my second edition eBook) in the how search engines work chapter.

That was three years ago. And I was inspired, more recently, to go look at my research again after MSN launched.

Susan Dumais, may be one of the most important researchers in this field.

And keywords that we live on right now... may never be the same keywords again, to everyone!

http://lsi.research.telcordia.com/ls...s/execsum.html

But, excuse me if this was covered earlier in the thread. I'm on holiday in Venice, Italy, for the carnival and don't have time to read it all just now.

Back with more later.

Cheers.

Mike.
Mike Grehan is offline   Reply With Quote
Old 02-06-2005   #126
xan
Member
 
Join Date: Feb 2005
Posts: 238
xan has a spectacular aura aboutxan has a spectacular aura about
She is a great lady and her stuff is always good. A microsoft chick! I will see her again in April at SEM, and perhaps at SIGIR. I encourage you to look at her papers, very very good.

Of course in this area there is also chen, Horvitz, jurafsky, hearst, the wonderful sir Brill, van rijsbergen,....

If computer science was a football team, they'd all be in mine

Have a great holiday.

Last edited by xan : 02-06-2005 at 08:03 PM.
xan is offline   Reply With Quote
Old 02-06-2005   #127
Mike Grehan
Member
 
Join Date: Jun 2004
Posts: 116
Mike Grehan is a name known to allMike Grehan is a name known to allMike Grehan is a name known to allMike Grehan is a name known to allMike Grehan is a name known to allMike Grehan is a name known to all
Quote:
Originally Posted by xan
She is a great lady and her stuff is always good. A microsoft chick! I will see her again in April at SEM, and perhaps at SIGIR. I encourage you to look at her papers, very very good.

Of course in this area there is also chen, Horvitz, jurafsky, hearst, the wonderful sir Brill, van rijsbergen,....

If computer science was a football team, they'd all be in mine

Have a great holiday.
I AM having a great holiday.

But I think your references may be taking this thread off topic again.

However, if it's pure information retrieval we're talking about... Then one of the masters (along with Salton) lives and teaches only 90 minutes drive from where I live:

http://www.dcs.gla.ac.uk/Keith/Preface.html

Now you're talking information retrieval ;-)
Mike Grehan is offline   Reply With Quote
Old 02-06-2005   #128
xan
Member
 
Join Date: Feb 2005
Posts: 238
xan has a spectacular aura aboutxan has a spectacular aura about
Hehehe.. this man (C. J. van RIJSBERGEN) is one of my favorite scientists. Baeza-yates remains in pole position.

Yes, this is off topic.

Back to LSI and semantics.
xan is offline   Reply With Quote
Old 02-06-2005   #129
Nacho
 
Nacho's Avatar
 
Join Date: Jun 2004
Location: La Jolla, CA
Posts: 1,382
Nacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to behold
Quote:
Originally Posted by xan
Baeza-yates remains in pole position.
Ricardo Baeza-yates, another great Hispanic superstar in the world of IR and yes, search too!

In his book, Modern Information Retrieval co-written with Berthier Riberiro-Neto he clearly states how "Latent Semantic Indexing is an approach introduced in 1988". This section is well worth taking a look at for those who have not done so yet.
Nacho is offline   Reply With Quote
Old 02-07-2005   #130
jorock
Member
 
Join Date: Feb 2005
Posts: 59
jorock is on a distinguished road
Quote:
Originally Posted by jazar
ok, sorry robert, didn't want to sound patronising in any way.



Same for ~mortage, removing cards is the same as removing credit. co-occurence factor is high within the ~mortgage "circle". Worthless to calculate the co-occurence factor outside of the circle then?
Definitely not worthless to know them, in many cases, the phrases are very strongly related.

example
~antivirus = virus scan

Just saying for the "simpler" screenscraper spec, make sure you can distinguish phrases.
jorock is offline   Reply With Quote
Old 02-07-2005   #131
jorock
Member
 
Join Date: Feb 2005
Posts: 59
jorock is on a distinguished road
Quote:
Originally Posted by Robert_Charlton
One search I've been watching since Florida is ~mattresses. Note that Lava Beds National Monument comes up something like #3.

Obviously, on regular searches, Google is not saying that "mattresses" and "beds" are synonyms. But I'd bet that, given two mattress pages, otherwise identical, except that one contained the word "bed" and the other didn't, the one with "bed" would rank higher.

I'd doubt that just "bed" would outrank just "mattress" if the tilde were not used.
The "Lava Beds National Monument" ranks well for beds, but doesn't rank for any of the synonyms

This is good info, it probably shows it's just a tie breaker, at least for competitive phrases.

The questions are,...
if the word "mattress" was on the page, would it rank higher for beds.

~beds = mattress
jorock is offline   Reply With Quote
Old 02-07-2005   #132
search10
Member
 
Join Date: Jan 2005
Posts: 9
search10 has disabled reputation
~games includes "chess" but no other specific game. The top 100 results for "games" yields no sites that feature chess, while card and puzzle games are represented.

How it was decided that only chess merits inclusion in ~games may be a good parlor game, but while ~ has been around quite some time and should not be ignored, there isn't even slight evidence it is particularly in play this past few days.
search10 is offline   Reply With Quote
Old 02-07-2005   #133
Robert_Charlton
Member
 
Join Date: Jun 2004
Location: Oakland, CA
Posts: 743
Robert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud of
Quote:
Originally Posted by jorock
The "Lava Beds National Monument" ranks well for beds, but doesn't rank for any of the synonyms

This is good info, it probably shows it's just a tie breaker, at least for competitive phrases.

The questions are,...
if the word "mattress" was on the page, would it rank higher for beds.

~beds = mattress
Yes, it's interesting that "Lava Beds National Monument" doesn't come up higher for ~beds... and it might indeed rank higher for that search if it contained "mattress" or "mattresses" on the page. That's of course when searching using the tilde... not necessarily so for default searches, though that's why many of us have been thinking about the tilde since Florida.

I remember in my tilde explorations noticing that "~kitchen" brought up "food." I've optimized for several "kitchen" related topics where "food" wouldn't really be contextually appropriate, but I remember asking myself if working "food" into the page would help.

My guess is that the AdWords Keyword tool Broadmatch suggestions might in fact be a better source for other terms to include.
Robert_Charlton is offline   Reply With Quote
Old 02-07-2005   #134
Robert_Charlton
Member
 
Join Date: Jun 2004
Location: Oakland, CA
Posts: 743
Robert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud ofRobert_Charlton has much to be proud of
A follow up thought... as I read the LSI papers, I get the sense that, if something like LSI is used as a weighting factor, what would end up being rewarded by this factor would be a proximity to the norm.

Do those more versed than I in this area feel this is so? If not, what's a more helpful way of looking at it?
Robert_Charlton is offline   Reply With Quote
Old 02-07-2005   #135
jazar
Member
 
Join Date: Sep 2004
Location: London
Posts: 39
jazar is an unknown quantity at this point
Quote:
Just saying for the "simpler" screenscraper spec, make sure you can distinguish phrases.
good point!

Quote:
~games includes "chess" but no other specific game.
the student who is in charge of the word games is fond of chess , and despises all the other games.

Quote:
My guess is that the AdWords Keyword tool Broadmatch suggestions
Have you noticed any similarities between Broadmatch and what ~ returns?
jazar is offline   Reply With Quote
Old 02-07-2005   #136
xan
Member
 
Join Date: Feb 2005
Posts: 238
xan has a spectacular aura aboutxan has a spectacular aura about
As I suggested before, it is very very unlikely that LSI/LSA is being used to weight any of this as it is well known in the research community that LSI alone is flawed because it doesn't take into account:

the concept space is not understandable by humans.

the information is all numbers without semantic meaning.

performance: The SVD algorithm is O(N2 k3), where N is the number of terms plus documents, and k is the number of dimensions in the concept space.
k will be small, from 50 to 350. As, N grows rapidly the number of terms and the number of documents increase. This makes the SVD algorithm unfeasible for a large, dynamic collection (like a search engine deals with).

General consensus for an optimal number of dimensions in a concept space is unknown. See Dumais (TREC), Deerwester, etc... all the findings are different.

Performing an SVD is simply too time consuming to do on a regular basis and much too expense because of this.

We don't know how many updates we can perform before precision and recall performance degrades (unacceptable to Google).

Deerwester S., Dumais S., Furnas G., Landauer T. and Harshman R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, vol. (41/6): 391-407.

I am not saying it isn't used. I'm saying it definately isn't used alone. Seeing it as the sole way of determining semantic relatedness between words is unrealistic.

Wordnet is often used as part of a method as well as other machine readable dictionaries. Latent would refer to the shortest measure of weights between 2 words.

The Lesk algo will use surrounding words to define the sematic class of a word in that context.
The resnick measure is based on a concept hierarchy.
Jing-conrath measure is based on the shorted path between concepts.
Hirst-StOnge measure the similarity between words in wordnet not restricted to nouns.
Banerjee-pedersen uses the words to the left and right of the target which are known to wordnet.
Pedersen opts for supervised learning methods,
Quillian uses shared words in dictionary definitions.
Niwa and Nitta use content vectors based on co-occurence found in a large corpora.
Agire and rigau use a similarity measure based on conceptual density to work out semantic relatedness in nouns

... The point of that list is to show that there are many ways to measure semantic relatedness, and there are many that I havn't even listed. LSI is the most basic form although revolutionary when introduced by the wonderful Susan Dumais.

I just think that thinking of LSI as Google's way of measuring term relatedness is short-sighted, with all due respect.

The best thing that you can do, is describe your business, the reason for having it there (i.e. N provides quality biscuits to it's customers. They are baked in our ovens ...).

Using links to related sites and so on is perfectly correct, and a clean well presented and coded site is beautiful for an index



Last edited by xan : 02-07-2005 at 09:30 AM.
xan is offline   Reply With Quote
Old 02-07-2005   #137
hard target
Member
 
Join Date: Feb 2005
Posts: 14
hard target is on a distinguished road
Quote:
Originally Posted by xan
...The best thing that you can do, is describe your business, the reason for having it there (i.e. N provides quality biscuits to it's customers. They are baked in our ovens ...).


Well, that just sound as if coming from google's adv. department - just write relevant contents, don't do any SEO, and you will be at the top... yeah, right.
The fallacy of this argument is in assuming that - just because google's algorithm is (or might be) based on so much theoretical knowledge - it works perfectly (or even just acceptably).
But it doesn't, does it? So, SEO purpose is to find what the algorithm really does (as precisely as possible), not what it's claimed noble purpose is (place the most relevant content on the top od SERPs), and them manipulate the content so that it seems most relevant to google (while still being acceptable to humans).

So, how does one do that?
1. By thoroughly understanding algorithms you mentioned (as I am sure you do) and then running a number of simulations giving different weights to different algos - something so brain- and time-consuming that is unrealistic to expect from SEO business.
2. Superficially understanding the main concepts of IR and then reverse engineering, starting from output, noticing patterns etc...
But to say "the best you can do is describe your business...." - I just don't buy that. What is then the reason for having SEO at the first place?
hard target is offline   Reply With Quote
Old 02-07-2005   #138
xan
Member
 
Join Date: Feb 2005
Posts: 238
xan has a spectacular aura aboutxan has a spectacular aura about
Quote:
Originally Posted by hard target
Well, that just sound as if coming from google's adv. department - just write relevant contents, don't do any SEO, and you will be at the top... yeah, right.
The fallacy of this argument is in assuming that - just because google's algorithm is (or might be) based on so much theoretical knowledge - it works perfectly (or even just acceptably).
But it doesn't, does it? So, SEO purpose is to find what the algorithm really does (as precisely as possible), not what it's claimed noble purpose is (place the most relevant content on the top od SERPs), and them manipulate the content so that it seems most relevant to google (while still being acceptable to humans).

So, how does one do that?
1. By thoroughly understanding algorithms you mentioned (as I am sure you do) and then running a number of simulations giving different weights to different algos - something so brain- and time-consuming that is unrealistic to expect from SEO business.
2. Superficially understanding the main concepts of IR and then reverse engineering, starting from output, noticing patterns etc...
But to say "the best you can do is describe your business...." - I just don't buy that. What is then the reason for having SEO at the first place?

I'm not an SEO. I just mean that if your site is clean that's a big plus. The only thing I am doing here is trying to point you in a more realistic direction, that's all. Notice the semantically related words in my example. As for what is SEO for? To get a favorable ranking using legitimate methods? I'm not from Google advertising. I have nothing to do with advertising. I'm a scientist, a researcher. Do what you like. I'm just sharing some knowledge.

My purpose for being here? My collegues don't believe our business and SEO can ever exist harmoniously side by side. I think SEO's and webmasters can perhaps help because sites being presented a certain way helps us. It's a bit like I have set out to prove it. A bet if you like. We'll see if I can show that it can be possible.

Last edited by xan : 02-07-2005 at 11:19 AM.
xan is offline   Reply With Quote
Old 02-07-2005   #139
hard target
Member
 
Join Date: Feb 2005
Posts: 14
hard target is on a distinguished road
Quote:
Originally Posted by xan
... I'm not from Google advertising. I have nothing to do with advertising...
Sorry if it seems that I implied that you were - absolutely not my intention. And your comments and references are valuable.
It is just that I didn't agree with one particular statement. It just seems that following your recommendation only, would result in "weak" SEO. Yes, I noticed your semantically connected example. But, this is basically the same answer as the one to to ever repeating novice question - "I have a site selling widgets ; I need more contents - how do I create it?" And the answer is almost invariably - "create a page about history of widgets, other uses of widgets ...." - this really means "create contents semantically connected to widgets". There is obviosly nothing wrong with that. But, SEOs need to go a couple of steps further - optimize the clean site where it is still clean but has the flavor of semantic connectiveness that google and/or other SE prefer - now, how to best di that is obviously subject to (this lively) debate.
I don't believe in "Create it [clean site] and they [SE] will come".
hard target is offline   Reply With Quote
Old 02-07-2005   #140
jorock
Member
 
Join Date: Feb 2005
Posts: 59
jorock is on a distinguished road
Quote:
Originally Posted by microsoft
~games includes "chess" but no other specific game. The top 100 results for "games" yields no sites that feature chess, while card and puzzle games are represented.

How it was decided that only chess merits inclusion in ~games may be a good parlor game, but while ~ has been around quite some time and should not be ignored, there isn't even slight evidence it is particularly in play this past few days.
Sites featuring chess make it in the top 200.

Sites featuring many of the other synonyms are all over the top 100.

~games -chess -gaming -cheats -games -gamer -software -activities -demos

The working theory is it's a tie breaker, and not the sole reason for rankings.

games - chess significantly reduces the number of results returned too.??
Any insight here?
jorock is offline   Reply With Quote
Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off