PDA

View Full Version : Major Google Changes: Latent Semantic Analysis?


rustybrick
02-03-2005, 12:15 PM
MODERATOR NOTE: This thread is for discussion of recent changes at Google that may be due LSI/LSA factors. For discussions of the update outside of this issue, see the What's Going On With Google: Feb. 2005 Update (http://forums.searchenginewatch.com/showthread.php?t=4153) thread to a guide for other topics.

I wrote about this at my blog under the title of Is the Google Sandbox Over? (http://www.seroundtable.com/archives/001477.html), but I thought I bring it to this forum.

Basically, I reported on some new changes at google (http://www.seroundtable.com/archives/001476.html) - some major SERP changes. But then I did more digging, and found that "about a handful of sites I launched within the year, all sandboxed for [company name] are now all ranking #1 for [company name]."

I would like to hear from you all, a good tool to check all datacenters is at http://www.mcdar.net/dance/index.php

Check and let us know if you see the same.

Mod Note: Changed the name of the thread to better reflect what is currently being discussed.

bakedjake
02-03-2005, 01:18 PM
What you're seeing is an old-style dance.

No major changes in sites coming out of the sandbox. But there are MAJOR goings-on in the index.

bakedjake
02-03-2005, 01:19 PM
As a matter of fact, I'm seeing old sites now go INTO the sandbox.

A lot of SEOs won't like this when it rolls out. Google's devalued links, and looks like they're increasing reliance on LSI... maybe IDF too, but it's too early to tell.

rustybrick
02-03-2005, 01:23 PM
I have seen:

- new sites out of the sandbox
- new pages showing as older pages in google
- major flux in serps

But the main point is that many are reporting that they are now coming out of the sandbox.

It might just be a major flux and that is it.

bakedjake
02-03-2005, 01:26 PM
This is 100% an LSI update.

Whether or not the sandbox is tied to that, I (and no one else) don't really know. But this is 100% related to LSI.

seomike
02-03-2005, 01:42 PM
Hey Jake can you spell out the Acronymns for us slow of thought :D

bakedjake
02-03-2005, 01:43 PM
You're an SEO mate. Use a search engine.

http://www.webmasterworld.com/forum34/665.htm?highlight=lsi

randfish
02-03-2005, 01:51 PM
I think he means Latent Semantic Indexing;

It's basically a method whereby search engines learn through indexing to associate certain terms with concepts - tiger woods & golf - saddam hussein & iraq, etc.

bakedjake - I wonder what evidence or examples do you have to support your theory? Why do you believe that LSI (or an upgraded version of it) is responsible and how do you see it affecting results? Are you suggesting that Google is better at theming pages, links, both?

I'd love to hear your ideas.

bakedjake
02-03-2005, 01:54 PM
Theming != LSI. Not in the way they're doing it. Theming is a nasty rumor started by someone who didn't understand the algo.

I'm not about to go post my research and examples on a public forum. But, I'll warn you now - if you're not varying your anchor text, and you're not writing pages synonymous with your term that don't contain the term you're targetting, you're going to be in a world of hurt within the next 90 days.

We've been tracking this update for the last 6 months. I was surprised to see it happen now - I honestly didn't expect it until next month or March, but it's here.

Oh, and figure out how to use the tilde query. If you haven't been using it to SEO your pages, you're about 4 months behind. Start now.

siteseo
02-03-2005, 03:09 PM
...if you're not varying your anchor text, and you're not writing pages synonymous with your term that don't contain the term you're targetting, you're going to be in a world of hurt within the next 90 days.

I find that hard to swallow. LSI is an "additional" way to associate keywords with websites - not a replacement. If I have a website about ducks, but have no articles that mention the word "quack" my site isn't going to dive into the abyss simply based on that.

If anything, a move by Google to expand the incorporation of LSI into it's algo would expand the impact of links from sites/pages that may currently be on the fringe of association. Course, it can also mean a further devaluing of unrelated links, but that's always a good thing.

bakedjake
02-03-2005, 03:21 PM
siteseo, you're 50% there. Keep asking questions.

it can also mean a further devaluing of unrelated links

Bingo! But who says unrelated has to equal off-topic...

Contrary to what your local SEO tells you, there's no such thing as a natural link campaign....

Combined with LSI/IDF, it looks like this is what they are doing.

Nacho
02-03-2005, 03:25 PM
LSI???

Theming = LSI? IMO, LSI is a subdivision of LSA (latent semantic analysis). According to Orion, both have been used since the late 90's.

From what I know, it's too computational expensive to be implemented by a commercial search engine. They may be using elements of latent semantics.

How do you know they are using this for indexing and ranking? Can you provide me with an example please?

bakedjake
02-03-2005, 03:32 PM
No Nacho, I said Theming != LSI... as in "does not equal".

Nacho
02-03-2005, 03:38 PM
No Nacho, I said Theming != LSI... as in "does not equal".
Sorry about that Jake.

I would still like to hear your thoughts on the rest of my post, please.

bakedjake
02-03-2005, 03:46 PM
From what I know, it's too computational expensive to be implemented by a commercial search engine. They may be using elements of latent semantics.

I don't think it's too computationally expensive. Just IMHO.

Play around with the tilde queries. "baby clothes", "infant clothes", "infant apparel" leads to some interesting results.

Added: mp3 player vs. ipod, too.

rustybrick
02-03-2005, 03:59 PM
Ok, lets step back a bit and quote some people. Daron Babin (aka SEGuru), said at the Super Session: History of SEO/SEM Theory and Testing - WMW Conf 7 (http://www.seroundtable.com/archives/001153.html) said something to the affect of

He recommends writing a page of content and pulling out the keywords, then give it to someone and ask them to figure out what they keyword is. He said its about the other words on the page, its that important. If the keyword is "apple" is the page about computers or fruit? :)

More to come, meeting...

bakedjake
02-03-2005, 04:03 PM
That's right, rb. That's sort of what I'm getting at, but combine it with advanced anchortext comparisons against those keywords on those pages that you pulled out.

To take your quote example and expand on it: I have a page about "baby clothes". I link to my site 100 times with the anchor text "baby clothes"

I now pull out the words "baby clothes" and all the links pointing to my site with the words "baby clothes"

Do I still have footing to rank for that term "baby clothes" after you've run some sort of semantic analysis on it?

That's my simplistic explanation. I think they're doing something very similar, but taking links into account like that and maybe even devaluing some links on the "main" term...

Which leads to the perception of some sites coming out of the sandbox (you don't do a huge link campaign on your company name), and some sites going into the sandbox (if i was previously optimizing for baby clothes with a 99% kwd). I'm still looking around here too...

glen, bear with me. I'm not trying to be a "pretentious twit", I'm trying to fuel conversation while I'm working on this to give me ideas. You could contribute, you know.

I, Brian
02-03-2005, 04:16 PM
Hm, I reported on LSI almost a year ago and the project was old even then - I was actually under the impression that Microsoft had been powering on this in partnership with University of Tennessee - when did Google pick it up?

LSI is certainly interesting - - - but I can't help but wonder about the impact on Google's relevancy, especially considering latest SERPs. I keep wondering if perhaps Google's search for greater relevancy isn't producing less relevant results. Another topic perhaps...

Anyway, Barry, the terms I'm watching aren't out, so maybe you just hit the end of sandboxing naturally?

I, Brian
02-03-2005, 04:18 PM
That's right, rb. That's sort of what I'm getting at, but combine it with advanced anchortext comparisons against those keywords on those pages that you pulled out.

Funnily enough, they've been doing that with AdSense for a while - I've been wondering when they'd apply that proper to normal search.

The WMW thread looks interesting, btw - thanks for pointing that out.

randfish
02-03-2005, 04:18 PM
There may be some members or guests who are very confused after reading this thread, but it is an especialy important one, and worth taking the time to understand.

LSA - Latent Semantic Analysis
The idea behind this is that by taking a huge composite (index) of millions of web pages, the search engines can "learn" which words are related and which noun concepts relate to one another.

For example, using LSA, a search engine would recognize that trips to the zoo often include viewing wildlife and animals, possibly as part of a tour.

Now, conduct a search at Google for ~zoo ~trips (http://www.google.com/search?hl=en&lr=&safe=off&rls=GGLD%2CGGLD%3A2004-12%2CGGLD%3Aen&q=%7Ezoo+%7Etrips&btnG=Search). Note the bolded words match the terms I italicized in the paragraph above. Google is bolding 'related' terms and recognizing which terms that frequently occur concurrently (together / on the same page / in close proximity) in their index.

Some forms of LSA are too computationally expensive. For example, Google isn't smart enough to 'learn' the way some of the newer learning computers do at MIT (see some news reports (http://news.google.com/news?hl=en&lr=&rls=GGLD%2CGGLD%3A2004-12%2CGGLD%3Aen&tab=wn&ie=UTF-8&q=learning+computer+MIT+language) on this). They cannot, for example, learn through their index that Zebras and Tigers are both examples of striped animals, although they may realize that stripes and zebra are more semanticly connected then ducks and stripes.

Theming
Theming is more of an SEO concocted subject that is floated around often - choosing a 'themed' page for a link rather than a non-themed page. Basically, theming is what Google bought the company Kaltix (http://www.google.com/press/pressrel/kaltix.html) for. They created the site-themed (http://labs.google.com/personalized/siteflavored) (flavored) search for Google, which is able to categorize many websites, based on their content/links/etc. into varying themes through a categorization structure.

Hopefully that provides some clarity for those individuals who may be puzzled. I'm certainly still puzzled as to how bakedjake came to this conclusion (although I really appreciate your contribution BJ).

glengara
02-03-2005, 04:26 PM
*You could contribute, you know.*

Point taken ;-)

I sort of took objection to the "We've been tracking this update for the last 6 months" rather than something like " we've been waiting a while to see this being implemented".

It's just down to semantics, I suppose ;-)

I, Brian
02-03-2005, 04:35 PM
Ah! Latent Semantics!

glengara
02-03-2005, 04:45 PM
It hasn't gone away, you know ;-)

I always assumed it would be used to make G less dependent on KWs to determine the page topic....

St0n3y
02-03-2005, 04:53 PM
This is a great thread, and something that we have been following quite closely. I am interested to know who bakedjake is 100% sure the recent changes are LSI related. I don't think i'm 100% sure about anything in regards to SEO.

jorock
02-03-2005, 05:35 PM
LSI is a probably a standalone build, it's a 2 step process, at least, The allintitles don't match when it's building. That's why some datacenters are off, numbers are down, etc.

LSI has been important since Florida and is gaining in importance, but I don't think the sky is falling.

It's just not done yet. These are "big" builds, they take time.

*Don't panic or celebrate until the allintitles match at all datacenters*

randfish
02-03-2005, 06:08 PM
jorock - What are you basing this information on? Why is the matching of allintitle results across datacenters important?

I, Brian
02-03-2005, 06:31 PM
Why is the matching of allintitle results across datacenters important?
That's one indicator that any big changes to the index have settled and the changes migrated across all DCs, I should expect.

LSI was one of those subjects that got at least a little attention after Florida, but it was Hilltop - and to some extent LocalRank - that really stole the show after that.

Nacho
02-03-2005, 06:51 PM
I still think that LSI is to expensive to implement. Let's assume LSI has been used, can someone show me the calculations or the numerical example how to compute LSI measures in connection with search engine indexing.

oilman
02-03-2005, 07:34 PM
>>when did Google get in to this?

I'm guessing about the time they bought Applied Semantics. Now why would a search engine by a company like Applied Semantics?

jorock
02-03-2005, 07:46 PM
jorock - What are you basing this information on?

Observations over time.

Why is the matching of allintitle results across datacenters important?

It shows the synonyms are matching up.

It also explains why sometimes filenames are case sensitive, it's during the build.

It's only one observation, not sure when they match is exactly when its done, but it's a good sign there's a 2 step process to the build.
"When it looks finished"
1. allintitles match
2. filenames aren't case sensitive.
3. result numbers go up
4. more synonyms based sites are included in the results, or rank better.

jorock
02-03-2005, 07:49 PM
I still think that LSI is to expensive to implement. Let's assume LSI has been used, can someone show me the calculations or the numerical example how to compute LSI measures in connection with search engine indexing.

How do you explain the ~ command then?

graywolf
02-03-2005, 07:56 PM
Anyone have a link to inverse document frequency (IDF) for dummies?

jorock
02-03-2005, 08:44 PM
So you think IDF is enough to exlain the heavy reliance on synonyms in the default search?

I'll concede if that explains the inclusion of synonyms in this build and in the results.

Only Google is 100% certain what they use for synonyms, but the evidence suggests that's what's being recalcuated now.
That's the point I was trying to make.
It's a seperate build.


Anyone have a link to inverse document frequency (IDF) for dummies?

orion
02-03-2005, 09:00 PM
Some quick observations.

LSA is based on the well known Singular Value Decomposition Theorem from Matrix Algebra but applied to text.

LSI uses this for indexing purposes, whether or not the terms are synonyms is entirely irrelevant. The key is the dimensionality of the sample under the "microscope".

Some IR systems use the ~ for finding similar or related terms. Other use it for proximity. Just depends on the particular implementation. One can even design an IR system to recognize ~ as a wildcard or as a find only synonyms.

Orion

PS. As Nacho mentioned, I would like to know of someone that has valid evidence of step-by-step calculations on how current commercial search engines implement LSI.

Nacho
02-03-2005, 09:07 PM
How do you explain the ~ command then?
Thank you Orion! :)

orion
02-03-2005, 09:13 PM
So you think IDF is enough to exlain the heavy reliance on synonyms in the default search?

I'll concede if that explains the inclusion of synonyms in this build and in the results.

Only Google is 100% certain what they use for synonyms, but the evidence suggests that's what's being recalcuated now.
That's the point I was trying to make.
It's a seperate build.About IDF or Inverse Document Frequency.

IDF = log(D/d) where D = collection size and d = number of documents containing a given term.

weight of a term, w=tf*IDF

Different tf and IDF measures have been published over the years and to the IR investigator needs.

This is well explained in the Term Vector Theory (http://forums.searchenginewatch.com/showthread.php?t=489) SEWF thread.

IDF has nothing to do with LSI. In fact, IDF predates LSI/LSA for several decades.

C'mmon, guys. Let not start misinforming/confusing posters.


Orion

jorock
02-03-2005, 09:39 PM
Good info, I use the term synonym for simplicity. Thanks for pointing this out.

I meant synonyms as in "they are synonyms that the system determines based on what's under the microscope".

How they use synonyms in the regular search seems directly related to terms produced with the ~ command. They don't seem to be standard language synonyms.

I'd like to see more direct evidence that it's LSI, or how it's done internally too. ;)

Some quick observations.

LSA is based on the well known Singular Value Decomposition Theorem from Matrix Algebra but applied to text.

LSI uses this for indexing purposes, whether or not the terms are synonyms is entirely irrelevant. The key is the dimensionality of the sample under the "microscope".

Some IR systems use the ~ for finding similar or related terms. Other use it for proximity. Just depends on the particular implementation. One can even design an IR system to recognize ~ as a wildcard or as a find only synonyms.

Orion

PS. As Nacho mentioned, I would like to know of someone that has valid evidence of step-by-step calculations on how current commercial search engines implement LSI.

orion
02-03-2005, 10:12 PM
I meant synonyms as in "they are synonyms that the system determines based on what's under the microscope".I feel (but I could be wrong) you are describing whether terms in a document are on-topic. This is semantic association, not synonymity. I apologize in advance if this is not what you mean.

Cheers

Orion

jorock
02-03-2005, 10:45 PM
Yes, that's what I mean.
I'll use that phrase from now on.
( I used synonyms because that's what Google calls the ~ )

The point I was trying to make was the semantic association seems to be a standalone build. And it takes a few days to complete.

I see a lot of people panic when looking at the datacenters where the results are off, I was hoping to provide some insight with my observations.

I'd appreciate any insight to my findings above during this build, all these things happen at the same time. For the past 4 months.



I feel (but I could be wrong) you are describing whether terms in a document are on-topic. This is semantic association, not synonymity. I apologize in advance if this is not what you mean.

Cheers

Orion

graywolf
02-03-2005, 11:05 PM
So in a nutshell it's looking for the most unique pages with the most links from pages with similar topics?

seobook
02-03-2005, 11:40 PM
So in a nutshell it's looking for the most unique pages
not necissarily most unique page...but placing extra weight on a page which is well defined as relevant to the search query by its contents outside of the specific matching word.

with the most links from pages with similar topics?
the related page stuff is more Hilltop (http://www.cs.toronto.edu/~georgem/hilltop/) related.


some of the semantic analysis that is done on the page content may also be done on the linkage data.

if most of your links are exact matching keyword rich links and few links with variations or synonyms then that linkage profile may not rank as well as a site that has an equal quality and quantity of links with more naturally mixed anchor text.

if your site name also contains your primary keywords and most of your links contain your site name or that specific keyword from it then you could end up not only ranking bad for your primary keywords, but also ranking poorly for your site name.

for good karma mix your achor text

graywolf
02-03-2005, 11:56 PM
but placing extra weight on a page which is well defined as relevant to the search query by its contents outside of the specific matching word.

So if a page was about apples, it would also expect to find words like trees, pies, and/or fruit?

Revisiting the unique issue, if you were to take a competitors page add enough extra stop words and different versions of his words that will stem back to the same thing, would that make his page "look worse" from the algo's point of view ?

seobook
02-04-2005, 12:22 AM
So if a page was about apples, it would also expect to find words like trees, pies, and/or fruit?
those could potentially fit well. keep in mind that there is also another version of apple too with computers mac os X imac etc
Revisiting the unique issue, if you were to take a competitors page add enough extra stop words and different versions of his words that will stem back to the same thing, would that make his page "look worse" from the algo's point of view ?
I doubt it. its across many many pages and in well developed communities a single page may not have much effect on the other pages unless it helps them trip a duplicate content filter, but to do that you might need to have more PageRank than the other page you are trying to delist.

it is not about most unique page. it is about page which is best matching.

Adam C
02-04-2005, 05:21 AM
As a matter of fact, I'm seeing old sites now go INTO the sandbox.

Have seen similar movements.


I remember when the ~ operator was first released. In the same month a search for "search engine optimiSation" started returning results with "search engine optimiZation" spellings. I took it at the time that there was some kind of mild implementation of the ~ in the main index. I could of course be completely wrong, as is often the case.

glengara
02-04-2005, 07:16 AM
I've noticed some pages that target "SEO" terms are now appearing for "search engine optimization" ones.

AFAIK there are no links/anchor text using the full phrase, so if down to some sort of LSI, it seems to trump even anchor text.

Adam C
02-04-2005, 07:56 AM
Just to clarify, what i was talking about was when word optimization was first bolded in searches for optimisation

hard target
02-04-2005, 12:14 PM
... and you're not writing pages synonymous with your term that don't contain the term you're targetting, you're going to be in a world of hurt within the next 90 days.


Wouldn't this imply that when comparing the following two searches:
1. ~keyword -keyword
2. keyword

the high ranking sites in #2 should also rank high in #1?. This doesn't seem to be the case on a small sample I looked at.SERPs are still dominated with "keyword" containing pages.
Did I misunderstand your post or does it have to do with timing --- "next 90 days"?

seobook
02-04-2005, 01:01 PM
Wouldn't this imply that when comparing the following two searches:
1. ~keyword -keyword
2. keyword

the high ranking sites in #2 should also rank high in #1?. This doesn't seem to be the case on a small sample I looked at.SERPs are still dominated with "keyword" containing pages.
Did I misunderstand your post or does it have to do with timing --- "next 90 days"?
many of the most relevant documents will also happen to naturally have occurances of the keyword so subtracting all of them out of the first subset of search results (like in #1) may end up making that set significantly different than #2

hard target
02-04-2005, 01:15 PM
many of the most relevant documents will also happen to naturally have occurances of the keyword so subtracting all of them out of the first subset of search results (like in #1) may end up making that set significantly different than #2
Agreed - so the original bakedjake's statement seems a bit radical, doesn't it? Or did I misunderstand it?

BTW, I have a great respect for bakedjake's posts in this and other forums; I just want to make sure that I understand correctly what the meaning of the post was.

xan
02-04-2005, 01:24 PM
Hi I’m a computer scientist, PhD. I thought I would stick my nose in and clear some things up:

LSA is not new at all has been used, as someone has previously stated since the 90’s. It’s expensive but it has been proven to work well. LSI is LSA, it’s just another name for it.

LSI is used to address 3 problems: Synonymy and polsemy.( ambiguity). It arranges words into a concept space. Given all of the concepts retrieved, a set of documents can be retrieved. It also overcomes noise (punctuation, odds and ends that make processing a pain)

New methods since have definitely been introduced such as using an associative network from a corpus (Guy Denhiere). It is incremental, which is more plausible, and takes into account higher order co-occurrences in the construction of word similarity. It can use different units of context whereas LSA uses the paragraph.

LSA represents the meaning of words as a vector, thus calculating word similarity. It’s not exactly rocket science but it has been efficient and is still used. The text here is considered linear. However any new semantic representation means running the whole thing again.

Other methods do exist for calculating words similarity which I will not go into detail about but will briefly explain:

SRCR (sparse random context representation) - 2002
It is assigned a random vector which is then updated with the vectors of co-occurring words.

WAS (word association space)
Words in similar contexts are placed in the same space.

LSA/LSI has major drawbacks:

The information is all numbers without semantic meaning – it’s hard to debug.

It uses the SVD algorithm. The SVD algorithm is O(N2 k3), where N is the number of terms plus documents, and k is the number of dimensions in the concept space. If the corpus is unstable and grows rapidly, its unfeasible. The SVD algorithm is unusable for a large, dynamic collection.

It’s hard to find the number of dimensions in the concept space. Nobody knows the optimal number to use.

precision-recall improves and then decreases after hitting an optimal state. So if you like it's unstable.

Using SVD on a large collection which is dynamic is horrendously expensive.

LSI is slow due to using a matrix method Singular Value Decomposition to create the concept space.

Popular methods include graph-based clustering and classification, statistics-based multivariate analyses (as well as latent semantic indexing: multi-dimensional scaling, regressions), artificial neural network-based computing (backpropagation networks, Kohonen self-organizing maps), and evolution-based programming (genetic algorithms)).

As you can see, Google has quite a choice of methods, and I doubt that LSI would be the best one considering the task at hand. The Google algorithm is complex and uses many methods found in information retrieval, data mining, and A.I. It is very unlikely that one method as routine as would be the main formula in this mathematical bundle. Also this issue only addresses semantic similarity not ranking, which i think is your priority.

Of course document similarity and topic detection are the main ways of returning relevant documents, however there are so many ways to do this and none of them are straight forward. In fact no one has yet found a stable way of applying methods that work very well in digital library collections to web data. The problem with data on the web is that it changes all the time. It's dynamic and unstable.

I keep a blog which deals with computing science methods where I explain these, and topics are based on things that I find in forums like this one, just to clear up any misunderstandings. I have no dealings with SEO, but visit SEO forums to assess how far professionals have come to using search and understanding it’s techniques. I work in A.I and computational linguistics. Sorry about the long post.

Berkey's explaination (www.sims.berkeley.edu/~rosario/projects/LSI.pdf)

search science (http://spaces.msn.com/members/search-science)

orion
02-04-2005, 01:40 PM
Thanks, Xan

Finally, someone is talking the truth about LSA and SVD. No news here. Good to have you in the SEWF.

On-topic analysis and co-occurence theory can be used to explain the above as well as terms disambiguation. See you all at the SES NY.

Orion

randfish
02-04-2005, 01:47 PM
xan,

Thank you for joining and contributing. It's important that we have people like you to help us out - your contributions mean a lot and your willingness to share is commendable.

I (and probably many other members of SEW) would love to visit your blog and read some of your writings - would you share it with us?

Also, the alternate methods you mentioned, along with LSI/A all are shooting for the same goal: to use the index of the world wide web to calculate the relationships between words in order to have a better understanding of which concepts are related and which are not.

As I see it from an SEO perspective, our goal is similiar (albeit for a different reason). We want to puzzle out which words and phrases are most semantically connected to one another for a given keyword phrase, so that as search engines crawl the web, they see that links to our pages and the content within them is semantically related as per the other information in their database.

However, we have a big advantage. We don't have to use an algorithm to calculate this neccessarily, and we're not concerned with computational expensiveness. Why? Because we optimize individually for single keyword phrases - meaning we can devote an hour or 24 to finding the most connected keywords/phrases.

Let me propose an SEO method for discovery.

#1 Search for your keyword phrase @ Google
#2 Take the text from the top 100 search results and put them into rows in a table (remove stopwords)
#3 Analyze the most frequently occuring 1, 2, 3 & 4-word phrases among the documents (discount duplicate entries from a single row/page).
#4 Take the keywords that show up in 5+ entries and conduct a C-Index (Keyword Co-Occurence) calculation for each.
#5 The results will give you the highest C-Index words/phrases for your particular term.

Xan, perhaps you can tell us if this is a good or faulty method.

I wish I was going to see you all at SES (alas, it's far outside my price range).

seobook
02-04-2005, 01:49 PM
your search science link is missing the "members" in its link

http://spaces.msn.com/members/search-science/

AIstudent
02-04-2005, 02:08 PM
Hello everyone,

what joy! I'm a final year student of Artificial Intelligence and Psychology (in Edinburgh) working long hours on my Bachelor thesis - which is about Latent Semantic Analysis. So for once in my life I feel I can contribute some knowledge :-)

First off, the LSA/LSI confusion: From as far as I have seen so far, LSA and LSI are exactly the same thing. The technology is usually called Latent Semantic Indexing if it is used in information retrieval context and latent semantic analysis if it is used for language modeling or most other applications. I'll go with LSA from now on.

How does it work?
LSA works with a vector space representation of words (and documents). You can imagine every word as a point in space, only that the space is not 3-dimensional but usually has anything between 150 and 500 dimensions. Words which are more "similar" are closer together in this space. What type of "similar" are we talking about? Well, let's look at how LSA vectors are constructed. You start of with a bunch of M (say, 10,000) documents and a dictionary/vocabulary of N (say, 20,000) word-types from a large corpus (usually > 10m words) . Now you build a NxM matrix where you count how often each word n (from N) occurs in a document m (from M). This is one hell of a fat matrix you end up with, but it contains some useful information: which types of words usually occur in the same type of documents. In a sense, you now have a M-dimensional vector describing each word in terms of where it usually comes up. The problem is that a) these vectors are large, and b) influenced by "noise" - maybe two words are actually quite similar but just by coincidence they rarely pop up in the same documents.
This is where the strange mysterious beast of Singular Value Decomposition comes in. I don't fully understand it myself (don't tell my supervisor..) but SVD basically "shrinks" the vectors to a smaller size (e.g. from 10,000 dimensions to 100). The resulting (reduced) vector for a word now, in a sense, contains the "concentrated" semantic information about that word. the beauty of it is that after the SVD process two similar words (e.g. "coke" and "pepsi") have similar vectors, even if by coincidence they never occurred together, just because they have many "common friends", e.g. "drink", "cool", "beverage", "soft drink" etc.
Time complexity is roughly proportional to NxM, if I remember correctly.

Uhh, that was too much maths, what's the bottom line?
LSA calculates a measure of similarity for words based on occurrence patterns of words in documents and on how often words appear in the same context or together with the same set of "common friends"

Seeing is believing. Can I try it?
Go to
http://lsa.colorado.edu/ and play around with the applications.

Where can I read up on it?
The papers at
http://lsa.colorado.edu/ are a good start.
They are written by Psychologists, which makes them easier to read than those written by Computational Linguists :-) (use Google Scholar (http://scholar.google.com/scholar?q=%22Latent%20Semantic%20Analysis%22&ie=UTF-8&oe=UTF-8&hl=en&btnG=Search) if you want to get your hands dirty on formulas.

Could Google (or any other engine) use it?
There's technical and legal aspects to this.
Technical first:
Simply allowing search which uses some LSA information about the keywords to maybe consider some similar words is simple. They just need a corpus (big G's 8,058,044,651 web pages should do fine for most purposes ), a dictionary file, and some standard SVD algorithms (like this (http://tedlab.mit.edu/~dr/SVDLIBC/) ), a few hours time while it calculates, and some minor changes to their search mechanism.
But LSA allows for something much more sophisticated: Just as every word can be represented as a semantic vector, so can every document be condensed into such a vector. This allows a judgement of how similar two documents are, way beyong just counting words. It even works for different languages (with a few tricks).
What's more, in LSA terms a "document" can be as small as a string of just a few words. So, you can compare semantic similarity between documents and search strings (or other pages... or a few sentences copied&pasted from another site...). Think for a second how much you could do with that!
For this use of LSA, for a corpus the size of the WWW, you'd either have to have a REEEAAAAAL big machine, or a new way of doing SVD, or a system which does it for a small core of a few million documents and then "weaves in" all additional documents into the existing vector space. As I said, time and memory complexity are roughly proportional to NxM. Here M, the number of documents, being several billion (and N being at least many 1000's), would be the critical factor.

Now, legal contraints:
As far as I am aware, some aspects of LSA/LSI for information retrieval are patented (Pantent Description (http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=/netahtml/search-bool.html&r=1&f=G&l=50&co1=AND&d=ptxt&s1=landauer.INZZ.&s2=dumais.INZZ.&OS=IN/landauer+AND+IN/dumais&RS=IN/landauer+AND+IN/dumais)) to the people who first worked with it (some of whom now work for compaines using LSA).
So, my guess is that, if Google (or anyone else) is interested in technology like this, they will either use some related approach which is not covered by the patent (am no lawyer, no clue how easy it is), or get in touch with the patent holders.

Hope this helps,

Tobi


P.S. Hope this is no forum abuse... will graduate soon, am looking for internships & jobs int his field. PM me if you can help

orion
02-04-2005, 02:31 PM
Please feel free to PM with your resume. I'm looking for top AI staff. But in the future please post in the Help section, as this is a bit off-topic.


Orion

millington
02-04-2005, 02:58 PM
I'm new to this so not sure if this is the right place to post, but here goes.

I have a website at www.construction-index.com which has steadily built up over three years to about 3,000 visitors per day via Google search engines (mainly .com and .co.uk). Then yesterday, February 3rd, the number of visitors via Google suddenly dropped overnight to about one third of its normal level (ie to only 1,000 visitors a day). Some of my pages continue to come up in the first few returns on the first page of Google returns; but many other of my pages which used to come high on the first page now come on page three or four of Google returns.

Has anyone else had the same experience? Any suggestions as to what might have caused this? Any suggestions for remedial action please?

xan
02-04-2005, 03:04 PM
Thank you for the warm welcome!

1 Search for your keyword phrase @ Google
#2 Take the text from the top 100 search results and put them into rows in a table (remove stopwords)
#3 Analyze the most frequently occuring 1, 2, 3 & 4-word phrases among the documents (discount duplicate entries from a single row/page).
#4 Take the keywords that show up in 5+ entries and conduct a C-Index (Keyword Co-Occurence) calculation for each.
#5 The results will give you the highest C-Index words/phrases for your particular term.


Ok. I see where you're going with this, and it is a valid method. What you are talking about is using N-grams (which are strings of words in sequence) and calculating term frequency and then idf (inverse document frequency).

This will easily give you a hunch as to what terms are generally being used, but for a thorough analysis, you would have to compile a corpus of relevant sites to yours.

Using semantic fields is a pretty good method to discover which terms are related.

Do remember to always look at it from a technical point of view. Topic detection between sites is carried out in order to retrieve relevant sites, followed by computational linguistic methods to sort again between this collection, then ranking methods are used to order these sites by relevance. That's a very basic model, but the ranking method will again use many computational linguistic methods.

So looking at it from this point of view - it's important to make sure that your site is relevant, and offers a large amount of information as this encourages density early on, I think you all know these basic but very effective methods.

When I rank, I use similar methods and I definately get rid of all the noise which will include websites that do not meet a certain threshold.

Sorry so going off topic here!! Basically, yes, it is a valid method.

(p.s: also know which stopword list is best for you to use, you may even want to make your own - wsj is used a fair bit)

xan
02-04-2005, 03:10 PM
your search science link is missing the "members" in its link

http://spaces.msn.com/members/search-science/

thank you seo book. ;)

orion
02-04-2005, 03:27 PM
#3 Analyze the most frequently occuring 1, 2, 3 & 4-word phrases among the documents (discount duplicate entries from a single row/page).
#4 Take the keywords that show up in 5+ entries and conduct a C-Index (Keyword Co-Occurence) calculation for each.
Just be sure to assign the proper meaning to the computed c-index.

If computing co-ocurrence for phrases be sure to instruct the system to recognize a term sequence as a phrase. Keep also in mind that this will force an ordering element into the retrieved set of documents, thus excluding documents without the target sequence.

Once you have identified the terms you must do an on-topic analysis. I wish you could be at SES NY.

Orion

xan
02-04-2005, 03:34 PM
You're right Orion, but you can around this by usng the tf for single words to n-grams and then doing an idf score on those.

you can then see where the precision decreases.

orion
02-04-2005, 03:50 PM
True . You also need to clearly specify the query modes in the search (AND, ANY) to get the right raw data.


Orion

orion
02-04-2005, 06:08 PM
One more thing. If you are going to use IDF you would need to know the size collection as any estimation will produce an error. Personally for this kind of test I would not use IDF as given in any term vector model.

Orion

xan
02-05-2005, 05:53 AM
Agreed Orion.

If I was you guys, I would cut to the chase and just check semantic fields.

jazar
02-05-2005, 06:21 AM
Still need to figure out what the ~ operator returns ... before wondering if the main algorithm is taking some semantic into account.

Xan, you method suggests that the results are based on co-occurence. That's seems to be the best guess. But then, how do you explain the results for ~mortgage for instance?

You get microsoft website in the top... words like "microsoft" & "corporation" are quite far from "mortgage".

The second question I have is how come the number of results for "~keyword" is often lower than for "keyword". I would have thought that google would search on keyword, and other semantically related keywords too, and therefore return more results.

xan
02-05-2005, 06:27 AM
btw, I'm not too sure if this belongs here, but as everybody is talking about google technology, here is an announcement from google earlier this year:

Moderator Note: I've removed what was a print of this review from Search Engine Lowdown. Please review that article for what was in this original post:
http://www.searchenginelowdown.com/2004/10/web-20-exclusive-demonstration-of.html (it's well worth a read).

It its place, I'll repost my own summary of that article and some related material. For a copy with active links, see the actual blog post: http://blog.searchenginewatch.com/blog/041008-073413

Google Demos Word Clustering

Andy Beal has a nice write-up of Google showing off its word clustering tools at the recent Web 2.0 conference: Web 2.0 - Exclusive Demonstration of Clustering from Google. Jason Calacanis also has an MP3 audio file of the presentation you can listen to.

We've had some hints at such technology before. Google Sets, which was released in 2002, lets you enter some terms and see other terms that may be related to it.

Related Searches came-and-went quickly with Google AdWords and have occasionally popped up in the past on a very small sample of Google users (see an example here).

Neither Google Sets or Related Searches provide clustering as was demonstrated or as can be seen via Vivisimo (or Vivisimo's recently launched consumer site Clusty). But some of the underlying clustering technology may be used for these.

Also interesting is mention of Google excluding "noisy" data to focus on the key part of a page. It's common that search engines may ignore "stop words" such as "the" when indexing or searching. However, Google's "named entities" would go beyond that to focus on the core content of a page.

Both clustering and named entities have interesting applications to searchers and search marketing. By understanding clusters of search results, it may be easier for Google (and other search engines) to determine pages that don't seem to belong somehow on a particular topic -- in particular, spam pages that given their often artificial nature might stand out more.

Similarly, understanding the key concepts of a page and first ranking pages based on a concept match, then following on an actual word match, might help eliminate some false poor matches.

Marcia
02-05-2005, 06:32 AM
Welcome xan, and I'd like to extend to you greetings, warmth, and appreciation for your contributions to our collective knowledge.

As in any industry, especially in the field of computer science algorithms, its important to stay very clued up on what's going on and what developments there are in methods. Try searching for white papers and find conferences on IR that release all the up to date papers, like TREC, SIGIR, IJCI, ECIR,... That is where you're most likely to get a clue
We try. ;)

Currently it looks like its going to be using techniques we've been playing with for 4 years, clustering has been around for a long time in the form of ngrams for one. For such a system (there's more than just word clusters) to be implemented on a scale like google's, lots of testing and time is necessary.
Yes, and we thank you. It is particularly noticeable with the new MS search. and we'd be well advised to look further into it.

xan
02-05-2005, 06:33 AM
Jazar I don't quite understand what you mean by "~mortgage". I'm probably being a dunce!

Marcia thank you for the welcome, glad to be able to share, and gain knowledge myself.

Marcia
02-05-2005, 07:00 AM
Granted that there are more beneficial occupations for our time than spending it watching Google updates, but I think we'd do well to heed Jake's admonitions at this point in time:

As a matter of fact, I'm seeing old sites now go INTO the sandbox.

A lot of SEOs won't like this when it rolls out. Google's devalued links, and looks like they're increasing reliance on LSI... maybe IDF too, but it's too early to tell.
I don't think Jake is sharing that merely to idly pass wind among us to push our buttons; I don't believe that's his style. ;)

FYI, take note, in this thread, of the admonition toward implementations of semantic diversity:

http://forums.searchenginewatch.com/showthread.php?t=3899&page=2&pp=30

I think what Jake is encouraging us toward here is quite sound, relative to current algos; and there's nothing we've got to lose by heeding his words but increased traffic and ROI.

ferret77
02-05-2005, 07:30 AM
I was reading this thread and I could help but inteject

I have been doing SEO for a little over three years which I know isn't as long as some of you but its still long enough to notice a pattern

Every time there is a big update/change in the serps its always <insert latest search theory here>

Its

Hilltop
Themeing
LSI

Its whatever, and In my experience it never is those things

I know talking about it makes you sound like more like a real seo "professional" but is there any proof what so every of any of these theorys

I still have lots of sites with pretty much totally unrelated links kicking butt

would that be proof that its not LSI?

What proof, I mean examples of serps show that links for sites with the same words in the them are worth more?

sugarrae
02-05-2005, 08:29 AM
I think what Jake is encouraging us toward here is quite sound, relative to current algos

I've been researching this for a bit and what I'm seeing in the serps coincides with some of what Jake is saying here, as well as stuff I'm sure a lot of us pick up on and don't post here as it gives us an advantage in the engines.

I may not understand all of the "thesis" style posts of some of those in this thread, but I know what I'm seeing in the serps. I may not be able to post 6 paragraphs of technical terms on why, but I'm still confident that I see merit in Jake's comments and plan to test them, as well as a few other theories I've picked up as a result of some of the comments in this thread.

xan
02-05-2005, 08:34 AM
I've been researching this for a bit and what I'm seeing in the serps coincides with some of what Jake is saying here, as well as stuff I'm sure a lot of us pick up on and don't post here as it gives us an advantage in the engines.

I may not understand all of the "thesis" style posts of some of those in this thread, but I know what I'm seeing in the serps. I may not be able to post 6 paragraphs of technical terms on why, but I'm still confident that I see merit in Jake's comments and plan to test them, as well as a few other theories I've picked up as a result of some of the comments in this thread.


sorry! :o

I'll shut up now, but if anybody wants me to explain something I am happy to do so, and I'll stick the contents of the LSA post I wrote on search science. I already wrote a brief explanation of it there, and a full explanation and history of ranking.

Marcia
02-05-2005, 08:47 AM
ferett, I honestly and truly don't care who thinks I'm an expert or not (which I am *not*, nor do I pretend to be, or by any means or stretch of the imagination care who thinks so or not, nor do I <sic> troll for business </sic> in forums as some most certainly do, which is a *fact*) - but there is hardly any proof possible for any *theory* regarding SEO - not that I have, or that I have seen.

If anyone claims to have "scientific proof" of what current algos are about and claims to be be able to optimize for them, more power to them. Hats off! I prefer, personally, to think they are fulla beans and don't for a minute hesitate so say so.

As for me, I prefer to remain and be the "Accidental_SEO" and fly by the seat of my pants, functioning mainly by intuitional cognition and happy to do so, since I believe that most "scientific" theory, unless submitted by totally reliable sources, is nothing more than mental masturbation devised to specifically throw the most naive of innocents off track.

Sorry, but that's how it is IMHO and how it looks from where I sit. Cynicism isn't inborn and intrinsic to our nature; it is acquired by means of vital, though painful, life experiences and recognition of reality which, while some may be loathe to accept the facts, cannot however be denied.

I just sometimes wonder what some people have against TRUTH, despite their vigilance in suppressing it, by software and otherwise, at times. What is there to attach a name to? I simply don't know, any more than why I should whore out for reasons I know not what - is it, is this, the way of the "system"? God pity us all, in His infinite mercy and graciousness, if this is what we have come to.

ferret77
02-05-2005, 08:48 AM
I'm interested in understanding it, I just curious do you have some real life examples?

My homepage site was recently knocked out of the serps for some of it main phrases

like "website design"

I thought at first maybe it was because of different algo, and since most of links are unrelated , then perhaps there is more value placed on related links.

But I run over 100 sites, and many of their positions haven't budged on this update

and since their links are just as unrelated as homepage if not more so, I don't see how any of these theories are shown in the serps

If you could post some examples I would really appreciate it

if you can't post any examples .......

Marcia, I'm not really refering to this thread in particular,

And I am not saying that theories are BS or useless

I'm just saying that often people use updates, like the the Florida update, to promote themselves, and whatever theory suits them

Which I think really gets in the way of understanding what is really going on.

randfish
02-05-2005, 01:49 PM
Xan,

I'm very happy you decided to contribute and hope that you stick around.

Back on topic, I understand the purpose behind LSA - to have a better understanding of the concepts that word combinations create and therefore be better able to determine the relevance of a particular document based on its text content.

However, from an SEO point of view, the most advanced work in IR research seems to be going in a direction of greater relevancy overall, rather then the prevention/restriction of so-called 'SEO' tactics. For the commercial search engines (in particular Google), there appears to be a dual-focus on both.

For someone who is trying to understand the technology & theories to get a better understanding of how to use them in his/her favor, it's great to have IR scientists like yourself to help out. I'm wondering whether you have any insight into the other side of things - spam recognition/prevention, counter-SEO work, etc.

Thanks Nacho - Broken off to this thread -Can IR Research Teach us Anything re: Spam Prevention or Anti-SEO Tactics? (http://forums.searchenginewatch.com/showthread.php?p=33221)

Nacho
02-05-2005, 02:24 PM
For someone who is trying to understand the technology & theories to get a better understanding of how to use them in his/her favor, it's great to have IR scientists like yourself to help out. I'm wondering whether you have any insight into the other side of things - spam recognition/prevention, counter-SEO work, etc.
Sounds like a great idea, you guys should start another thread on that. :)

Let's keep this thread on the idea of . . . Are Google's latest changes have anything to do with LSA? If yes, why? If no, why?

Thanks!

xan
02-05-2005, 02:37 PM
Hello Randfish,

firstly thank you for all the praise you make me blush!

work is currently being done in information retrieval, user interfaces, databases,information systems, interaction strategies to improve relevance feedback.

Topic detection has always been a major area of IR, we can't yet do it a 100% and maybe we never will, language is unstable and fickle.

Data mining is very important for IR systems to survive, otherwise they will get swamped and be useless. We need to be able to deal with this information overload. It is a priority.


The semantic web is a very interesting area of research at the moment, and you can find information about that on the w3c site: "The Semantic Web provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries."
At the moment semantic web content makes up only .005%
of the web. It will grow dramatically.

At the moment documents have no machine readable semantics, so it has been suggested that we should restructure the document content in a representation that could be exploited by machines.

Google are working on this, see Dr. Norvigs comments here:
Semantic web thoughts and google (http://www.alwayson-network.com/comments.php?id=P7480_0_3_0_C)

He states:

"For the most part we throw away the meta tags, unless there's a good reason to believe them, because they tend to be more deceptive than they are helpful. And the more there's a marketplace in which people can make money off of this deception, the more it's going to happen."

Anyway, personalization techniques are a way of trying a short-cut approach, it can't work for long really, it's too laborious for the user.

I hate to tell you this guys, but SEO's are the bane of my life, my search results are all messed up by it and my algo testing comes back with errors and things, so of course, anti-seo techniques are also pretty important. It's not about fighting the SEO market at all. It's about delivering accurate and relevant results to a user or for a system to use as a knowledge base. Good clean SEO is good for me too, because it describes the site well, and the content is good, excellent! Things like the semantic web could really help us both potentially. By educating I believe this can be a happy world where can all live together.

For spam recognition and prevention, I use vector based models which fall into A.I categories. They are crucial to information retrieval, because what you consider spam may change depending on the task at hand. If I want to pool data dealing with cars, then everything else is spam to me in this sense. Applying the same technique to your mail box can be ok, but there is an error rate. The best method is to never post your email address online. Not helpful, I know, sorry.

Greater relevancy is always the goal, but rank is also a royal pain. PageRank is well past it now, and others have been tested and developed and work well, but they are based around certain topics. If good results emerge from those techniques, why not classify by catetory and deal with it that way?

you could play with this:

toolkit (http://sourceforge.net/projects/irtools)

and think about how you would improve it.

Word classification is still very effective and is still applied. Corpus linguistics will always be crucial. I think everybody knows about the Boolean model and the Vector Space Model, but not so much about less known techniques such as fuzzy set theoretic models.

Web mining is the area of IR which applies to you guys. IR deals with a lot of different things like digital libraries and databases etc...

uuuurrrrgggg...long post again, sorry! I hope I answered your question Randfish.

jazar
02-05-2005, 04:27 PM
Hard to keep up, you leave a post, and when you come back a few hours later, 20 people have already posted their own, and yours gets lost :rolleyes:

Xan, I am just wondering why when I type "~mortgage" (http://www.google.com/search?hl=en&q=%7Emortgage) in google, I get microsoft ranked #5.

~ returns pages with terms matching or related.

If you look at the most frequent words on the microsoft home page, they are "microsoft", "corporation", etc.. semantically very far from "mortgage".
(c-index for microsoft & mortage or corporation & mortage is very low.)

My point is that I need first to figure out how to anticipate search results when I explicitely search for semantically related pages before even thinking about trying to work out if lsa would explain the latest changes in google SERP.

Other points I cannot work out yet:
- why does google return more results for ~car (http://www.google.com/search?hl=en&q=%7Ecar) than for car (http://www.google.com/search?hl=en&lr=&q=car)

- why does google returns the same number of results for ~nokia (http://www.google.com/search?hl=en&lr=&q=%7Enokia) and nokia (http://www.google.com/search?hl=en&lr=&q=nokia), but doesn't return the same number of results for phone (http://www.google.com/search?hl=en&lr=&q=phone) and ~phone (http://www.google.com/search?hl=en&lr=&q=%7Enokia) (related to nokia)

I would be more than grateful if anybody could shade some lights on these results.

xan
02-05-2005, 05:22 PM
I don't know why ~mortgages returns microsoft. Sorry.
If you type in ~mortgage google you do get lots of references to finance and banking, which seems in line with this theory.

~car returns BMW, auto, automobile,... it has... 34,900,000 results.
car 309,000,000 results.
Nokia 190,000,000
phone 302,000,000
~phone 39,100,000

Curious isn't it, I understand your confusion. Let's assume that semantic fields are used alright. If they are, different terms although related will include different terms which are closer to their representation/meaning. Brands like "nokia" are pulled out as being proper nouns and so results specific to the term are targeted. If you type in ~phone you will see that "mobile", "Telephone", "long distance",... are also highlighted as relevant to your query. These are in the same semantic field.

What is happening is that when you type in "phone" you are asking for documents that contain this term specifically.
When you type in "~phone" you are asking for as you said "pages with terms matching or related" - this means that semantic fields are used to determine the relationships. This would also account for the disparity in the number of results returned.

well...that's my version anyway :)

mcmrob
02-05-2005, 05:22 PM
Other points I cannot work out yet:
- why does google return more results for ~car (http://www.google.com/search?hl=en&q=%7Ecar) than for car (http://www.google.com/search?hl=en&lr=&q=car)

When you use the tilde (~) in front of a keyword, car in this case, Google will return not only the results for "car" but also for the keywords the have a strong connection (are related) to the keyword "car". "Motor", for example, is a word that's connected to "car". As you can see, it's highlighted in the search results even though you did not search for it. They're related. Because of that, Google will also return pages that do not even contain the word "car", but that do contain the word "motor".

It's a little hard to explain.

Edit: Xan beat me to it with a better explanation :-)

jorock
02-05-2005, 07:16 PM
~mortgage

cached version of Microsoft...
"These terms only appear in links pointing to this page: mortgage "

jazar
02-05-2005, 07:46 PM
Sorry, made a mistake. My question was: "why does google return more results for car than for ~car?"

If google return also related results, it should be the other way round, shouldn't it?

I'll try to pull a theory together then.

My first theory would be that when you use "~" in front of your keyword, google will look into its Dictionnary, find synonyms (collected by its spiders or input manually, I don't really care at this stage) and just rank them using the same formula as usual. Doesn't look like it.

Then, given the results for ~car and car, I build my second theory: google uses two different spaces, not connected, and built separately. One is used to return results for general queries, and the other for "~" queries.

The second space takes more resources to build => smaller than the first one.

Now, when you query ~car, google will first look into its "dictionnary". If google finds the word in it, it returns results from the related semantic "field" (I mean kind of "sub" space, containing documents containing keywords & related keywords), information available from the second space only. if it doesn't find the term in the dictionnary, it will return unfiltered data, from the first space.

this dictionnary is built "manually".
=> this explains why nokia and ~nokia return the same results (Nokia is not in the dictionnary).

Now, if I carry on further, why would google need to use a dictionnary to decide on filtering the data or not? To speed up the process. Picks up the pointer, and go straight to the right "sub" space.

This pointer is not set up dynamically, because otherwise google would not need the dictionnary in the first place. The only solution I can think of is that the pointer is pointing to the keyword vector itself: therefore, my theory is that there is as many "space fileds" as the number of keywords in the dictionary:

=> ~bulb returns results with "lighting". But ~lighting doesn't return the same results.

Now why microsoft is ranked #5? click on the www.microsoft.com/+~mortgage&hl=en&start=4]cached (http://216.239.59.104/search?q=cache:VRA9OSkpgNsJ:[url) page[/URL] (thanks Jorock)
"These terms only appear in links pointing to this page: mortgage".

Looks like a allinanchor cache: only offpage factors are taken into account. what does it mean? It means that it is useless to use related keywords on your page, you only need to focus on your anchor text.

Now, if I type "mortgage" in google, microsoft doesn't appear anywhere. Can anybody point me out a result suggesting that google uses semantic fields to display results in the main SERP?

This theory may be complete fantasy, but at least it explains the results I see.

detlev
02-05-2005, 07:47 PM
~mortgage

The reason Microsoft appears in the results is their relative high PageRank and the term corp appears to be either semantically or synonymonically [sp? a real word?] connected to the term mortgage.

Mocrosoft is number one for: corp.

cached version of Microsoft...
"These terms only appear in links pointing to this page: mortgage "

I think this is a basic object placed in Google's display code whereby they automatically display the "links pointing to" thingy when the term isn't found in the cache copy of the doc. It's just a throwaway explanation of why the rank is there when the term isn't.

Microsoft is not Top100 for mortgage.

Hope this helps,
-detlev

jazar
02-05-2005, 07:56 PM
I think this is a basic object placed in Google's display code whereby they automatically display the "links pointing to" thingy when the term isn't found

You are right. I was wrong then when I said that on page factors should not be taken into account.

jazar
02-05-2005, 08:07 PM
The reason Microsoft appears in the results is their relative high PageRank and the term corp appears to be either semantically or synonymonically [sp? a real word?] connected to the term mortgage.

If I follow advice and calculate c-index for these two words, they appear to be quite far to each other. Is there an other way to identify terms "semantically" connected?

Even if I am wrong saying that "only" offpage factors should be taken into account, I still think here that they get this rank only because of large number of inbound links containing keyword related to mortgage (hign PageRank is just consequence).

jorock
02-05-2005, 08:13 PM
thanks Detlev,

The only thing I find confusing is, you can exclude the sites with links, and Microsoft is not in the top 100.

~mortgage -inanchor:mortgage
~mortage -allinanchor:mortgage

Maybe corp isn't as strong of a synonym as the others?
Or they have at least one link that says mortgage?

Notice, you see the same " term in links" message on "-inanchor searches" on other sites that still rank, so I guess the default error message was throwing me off.

The reason Microsoft appears in the results is their relative high PageRank and the term corp appears to be either semantically or synonymonically [sp? a real word?] connected to the term mortgage.

Mocrosoft is number one for: corp.



I think this is a basic object placed in Google's display code whereby they automatically display the "links pointing to" thingy when the term isn't found in the cache copy of the doc. It's just a throwaway explanation of why the rank is there when the term isn't.

Microsoft is not Top100 for mortgage.

Hope this helps,
-detlev

jorock
02-05-2005, 08:16 PM
Here's an interesting search: It produces no results, did I find all the semantically related terms?

Or is it just a coincidence that all the pages that rank for ~sewing contain these words?

~sewing -sewing -patterns -craft -butterick -sew -embroidery

Notice, I didn't exclude any anchors.

If I follow advice and calculate c-index for these two words, they appear to be quite far to each other. Is there an other way to identify terms "semantically" connected?

Even if I am wrong saying that "only" offpage factors should be taken into account, I still think here that they get this rank only because of large number of inbound links containing keyword related to mortgage (hign PageRank is just consequence).

jazar
02-05-2005, 08:41 PM
looks like it :cool: Not that many keywords ...

found the mortgage ones ~mortgage -refinance -loans -home -lending -finance -mortgage -financial -bank -interest -corp

I cannot believe that there are so few related keywords. The index cannot be built dynamically. I bet you that they have hired a bunch of student to build the list manually for each word.

randfish
02-05-2005, 08:46 PM
If I follow advice and calculate c-index for these two words, they appear to be quite far to each other. Is there an other way to identify terms "semantically" connected?Excellent question Jazar - I've been wondering about this too...

Does anyone know of another good method/equation for calculating the semantic connectivity of two keywords or two keyword phrases?

jazar
02-05-2005, 08:58 PM
and some more refining:
~loans -rates -credit -finance -loan -financing -loans -lenders

note that
~loans -rates -card -finance -loan -financing -loans -lenders
~loans -rates -cards -finance -loan -financing -loans -lenders

return the same ("~loans -card" and "~loans -credit" return the same as well, but "~loans -card" and "~loans -cards").

jazar
02-05-2005, 09:15 PM
The method seems to be:

scrape the first result page, and exclude all the bolded keywords in your query (adding "-" in front of them). re-iterate the process until you find "did not match any documents." on the page.

that's it, you have got you google suggestion tool :D.

jorock
02-05-2005, 09:22 PM
The method seems to be:

scrape the first result page, and exclude all the bolded keywords in your query (adding "-" in front of them). re-iterate the process until you find "did not match any documents." on the page.

that's it, you have got you google suggestion tool :D.

I thought that was it, I first stumbled upon it with the sewing terms, other times I always got more results, just wasn't digging deep enough. :)

Is there any insight to the number of results returned to see how strongly related terms are as you narrow it down?

Ron Niebrugge
02-05-2005, 09:39 PM
I have optimized my pages around the word “photos”. I could not understand why I suddenly started getting lots of hits from people searching with words like "pictures", "picture" and "pics"; these are words that don’t even appear on the resulting page. This change must explain it. In November, I didn’t get a single hit from the search word "pictures", so far this month it is my third most searched keyword.

hard target
02-05-2005, 09:42 PM
I thought that was it, I first stumbled upon it with the sewing terms, other times I always got more results, just wasn't digging deep enough. :)

Is there any insight to the number of results returned to see how strongly related terms are as you narrow it down?

Is there any evidence that "~" algorithm is the same as (or at least the part of) the one that google uses for purpose of finding relatedness / relevance during the ranking process?

jorock
02-05-2005, 10:28 PM
This has been a major factor since Florida, I think they've recently been turning it up.

I'll try to keep this theoretical, but, I've been using it since then. It works.

The answers seem to be in the results pages, look at the top ranking sites and compare the use of ~words.

In my original post in this thread, I tried to suggest and offered some evidence the datacenters where the allintitle's don't match, means the related words isn't done cookin yet.

I'd like some feedback on that.

It seems to indicate at least a 2 step process.

allintitle:widget software
is different than
allintitle:software widget

Compare rankings for sites that use these words in the datacenters where the allintitle's match, and again in the ones that don't, significant differences in most cases where sites that use the ~ words rank.

jazar
02-05-2005, 11:32 PM
:confused: Have tried with "blue train", "black socks" ... all the data centers I have tried tend to return different results based on the order of the words, which make sense.

Have you got an example?

jorock
02-05-2005, 11:44 PM
Disclaimer: I originally posted this just to show the update wasn't finished, I think it's the semantics, I asked for feedback, but didn't hear any.

I watch one term, antivirus software

allintitle:antivirus software
if they match:
pandasoftware will be in there.

allintitle:antivirus software
if they don't match:
pandasoftware isn't in there

their title says Panda Software Antivirus.

The rankings in the regular search still ranks panda well, the point I'm trying to make is look at how different the other rankings are, total search results are down, and not sure if it's related, the filenames will be case sensitive in the ones where in the allintitle's, pandasoftware is gone.

If you use ~words, you will rank better in the ones where it matches.

jazar
02-06-2005, 06:34 AM
ok,

1) allintitle:antivirus software
http://64.233.171.147/search?hl=en&lr=&q=allintitle%3Aantivirus+software
panda is not in the listing (176,000 results)
http://64.233.161.105/search?hl=en&lr=&q=allintitle%3Aantivirus+software
panda is in the listing (124,000 results)

2) allintitle:software antivirus
http://64.233.171.147/search?hl=en&lr=&q=allintitle%3Asoftware+antivirus
panda is in the listing (115,000 results)
http://64.233.161.105/search?hl=en&lr=&q=allintitle%3Asoftware+antivirus
panda is in the listing (123,000 results)

In both cases, the results don't match, but panda is in listing.

http://64.233.161.105/search?hl=en&lr=&q=allintitle%3Aantivirus+software
panda is in the listing.

displays panda listing even though "software antivirus" is in there, not "antivirus software". Seems to work in one datacenter, not the other one.

If you start a query with [allintitle:], Google will restrict the results to those with all of the query words in the title. For instance, [allintitle: google search] will return only documents that have both "google" and "search" in the title.

doesn't sound like google is looking at the order in which the keywords are displayed to match the query.

http://64.233.161.105/search?hl=en&lr=&q=allintitle%3A%22antivirus+software%22
strange: panda is listed on the top, even though nothing is bolded. I would have expected in this case that google would respect the order.

A bit strange. But if I had come across this, I would have thought:
on ..105, number of results is roughly the same whatever the order (~120000 results) => looks like google is more accurate and returns what it is actually expected to return: Same number of results. Still looking at the order to rank results => ranking is different: takes other factors into account such as the content of the page itself.

How do you link that to ~, and to lsa?

xan
02-06-2005, 08:37 AM
looks like it :cool: Not that many keywords ...

found the mortgage ones ~mortgage -refinance -loans -home -lending -finance -mortgage -financial -bank -interest -corp

I cannot believe that there are so few related keywords. The index cannot be built dynamically. I bet you that they have hired a bunch of student to build the list manually for each word.

It's not that there are so few keywords in that semantic field, there are many, too many - so how would you use that method to find the most related documents?

Semantic fields are always computed using things like wordnet or another type of machine readable thesaurus/dictionary. I don't think students would be a realistic solution for this, or actually any human input as it would take much too long. Think about all the words in the english language let alone in all those foreign languages that Google covers.

(anyway, a lot of students are very competent these days)

"~" is not an algorithm. Its a command. An algorithm is somewhat more complex than that thank god or I would be out of a job. I quote from the columbia encyclopaedia "The software that instructs modern computers embodies algorithms, often of great sophistication." An algorithm is a recursive process for solving a problem. It has a given number of steps, logical procedures.

Or is it just a coincidence that all the pages that rank for ~sewing contain these words?
~sewing -sewing -patterns -craft -butterick -sew -embroidery .... No.


google suggestion tool would have to run on the principle of synonymy and semantic fields. How else do you propose to find keywords which are related.

Has anyone thought about the size of the documents? If you are calculating idf and tf on documents of different length this is immediately flawed as you are effectively penalizing documents which are shorter.

You have to normalize the values. This means that each term is given a weighting between 1 and 0. In order to do this you have to use tf*idf. After this you would normally use a vector space similarity measure. After this you are really entering into proper information retrieval territory. Different types of probablistic models can be used. There's a lot of different way here to determine document similarity, precision and recall effectiveness.

Some other ways would be:
Average Absolute Document Frequency, Average Absolute Query Frequency, Genetic Algorithms , Neural networks, Fuzzy Set, Inference Networks,...

Ranking by term frequency is still Extended (Weighted) Boolean Retrieval, unless you use vector space models, cosine measures, the dice or jaccard methods, and I am sure I forget to mention many.

Basically, I have to say that it is much more complex than LSA/LSI or semantic fields. I personally would use these are the smallest variables to work with, they give me a score to use and a starting point but are definately not the end product of my work. There are many methods and Google definately uses a very complex collection of computational linguistic algorithms and systems. What the method is? I don't know. I don't work quite in that area or on such a large scale, which is the beauty of large search engines like that.

jazar
02-06-2005, 10:51 AM
You are thinking from the other side of the barrier Xan. I am building my theory based on my observations. Not based on my understanding of IR systems or algorithms.

google suggestion tool would have to run on the principle of synonymy and semantic fields. How else do you propose to find keywords which are related.

Google has already done the work for me. I don't plan to build my own IR system, just use what google returns for me. And then I'll try to figure a simplified system to anticipate these results.

I don't do engineering here, but reverse engineering, which is the opposite. I don't have to worry to much about tf*idf, etc .. since it doesn't help me at the moment to explain the results I see. I understand that this is what makes the weel turn, but I am only interested in the direction the car is taking.

hard target
02-06-2005, 11:32 AM
You are thinking from the other side of the barrier Xan. I am building my theory based on my observations. Not based on my understanding of IR systems or algorithms.



Google has already done the work for me. I don't plan to build my own IR system, just use what google returns for me. And then I'll try to figure a simplified system to anticipate these results.

I don't do engineering here, but reverse engineering, which is the opposite. I don't have to worry to much about tf*idf, etc .. since it doesn't help me at the moment to explain the results I see. I understand that this is what makes the weel turn, but I am only interested in the direction the car is taking.

Right, as Xan points out the theory is very complicated, and Google's algorithm is for all intents and purposes a black box. Fortunately, a black box for which you can provide inputs and analyze outputs and establish patterns...
I am sure that some understanding of IR algorithms is beneficial, but there is quickly a point of diminishing returns. This is really not an "anti-science" position (if anybody would construe it that way); it's just that the "art and science of reverse engineering" seems to be more crucial to SEO.

randfish
02-06-2005, 12:12 PM
I don't do engineering here, but reverse engineering, which is the opposite. I don't have to worry to much about tf*idf, etc .. since it doesn't help me at the moment to explain the results I see. I understand that this is what makes the weel turn, but I am only interested in the direction the car is taking.I think you're unwise not to think about tf*idf. It is that understanding that can help you see why your competitors might be ranking ahead of you. Look at Orion's old thread on term weighting...

Xan,
Did you get a chance to see my previous questions? Is there another way (besides C-Indices) to calculate the co-occurrence of two terms/phrases in a search engine's index?

xan
02-06-2005, 12:37 PM
Sorry if I got too in depth there guys.

I understand what you are trying to do but was concerned with the use of idf scores without normalizing them.

"Is there another way (besides C-Indices) to calculate the co-occurrence of two terms/phrases in a search engine's index?" by Randfish

- if you want to look for terms next to each other in an index, then it's n-grams for which you also have to assign weights for. Then do a search to pull them out i guess. The tf*idf gives you an accurate weighting of each term and you can also apply it to a cluster of terms. I am not sure I answered your question.

Reverse engineering is a daily task really, and is also important for us too (especially when we make mistakes). I think what I meant was that there is only so much to learn from it as the amount you can do is limited if your knowledge is limited.

By all means using semantically related terms is likely to help you. There is no way to measure semantics or predict relevance or ranking either. Doing all the right things though will always help. Keep your content relevant, you can use semantics, link to relevant and clean sites to ensure we keep a certain level of quality, and promote yourself as you would in person.

jazar
02-06-2005, 01:03 PM
I don't think about tf*idf in the scope of this thread, or c-index.

If I have to determine the keywords I want to use on a site, and how I want them to be distributed across the pages, yes, I'll worry about the frequency of keywords, and the way they are distributed across the site.

But in the context of finding out how the algorithm behind ~ works, and if this understanding can lead us to some conclusions about whether or not google uses some semantic factors in the main index is an other story.

You agree with me that there must be an other way to calculate how semantically close 2 terms are, to explain why "mortgage" & "corp" are considered as synonyms by google. And explanations given about term weighting don't give me the answers.

I just don't see at the moment how tf*idf or c-index calculations could answer my questions, and help me setting up my tests. I may be wrong, but please point me out to the right direction if there is any sign showing that I should worry about this part of the algorithm in the context of this thread.

jazar
02-06-2005, 01:06 PM
There is no way to measure semantics or predict relevance or ranking either

Isn't it what SEO is all about ... ?

xan
02-06-2005, 01:24 PM
I guess that is what SEO is to you, but I simply state that there is no magic bullet or exact science. I would hope SEO is about optimising sites to be as good as they can be, given the tools and knowledge available.

"algorithm behind ~ " - please, it's a command which invokes a method.

"You agree with me that there must be an other way to calculate how semantically close 2 terms are, to explain why "mortgage" & "corp" are considered as synonyms by google. And explanations given about term weighting don't give me the answers."

No, it's not about how semantically close words are because this changes constantly depending on the context and the nature query if you like. Imagine 2 circles overlapping. The area where they overlap is considered to be the correct words for that context - if you see what i'm trying to say. Each of the circles may represent a subject area.

tf*idf and the c-index idea will simply help you see how competitors are arranging their content. In your world, your target is to rise in the rankings, so your only concern is the competition ranking above you.

randfish
02-06-2005, 01:35 PM
xan,

Is there an equation for calculating n-grams in search engines? I wasn't quite clear on whether this was an alternate formula to the C-Index calculation formula.

xan
02-06-2005, 01:39 PM
Nope, no equation. just use a tf script and modify it to take in more than 1 word.

Nacho
02-06-2005, 01:56 PM
No, it's not about how semantically close words are because this changes constantly depending on the context and the nature query if you like. Imagine 2 circles overlapping. The area where they overlap is considered to be the correct words for that context - if you see what i'm trying to say. Each of the circles may represent a subject area.
I hope this helps paint a visual picture.

http://www.ihispanic.com/sewf/search-overlap.gif
<added>Dimensions of circles are not ment to be a accurate sizes, but just an example of how kw1 and kw2 overlap.</added>

jorock
02-06-2005, 02:11 PM
Is there any evidence that "~" algorithm is the same as (or at least the part of) the one that google uses for purpose of finding relatedness / relevance during the ranking process?

The ~ seems to issue an OR search behind the scenes. I don't think it's a seperate algorithm. The OR is fed with the synonymns.
There isdefiitely an order of mportance to the results, meaning, some terms are stronger synonyms than others.

The results in the main search for OR are very similar to the ~ results in many cases, similar, not exact matches, ordering of words is important, and weaker synonyms tend to throw the results off.

So, to agree with Xan, there is an advantage to finding "an order" to how semantically related each term actually is.

jazar
02-06-2005, 02:19 PM
nice picture Nacho, translates in a nice picture what the equation c = n12/(n1 + n2 - n12) means.

The question is: if you draw the circle for the keywords "mortgage" & "corp", you will see that the orange area looks quite small compared to the red & yellow ones.

But "~mortgage" suggests that "mortgage" & "corp" are semantically connected.

Could you please draw me a picture Nacho to explain me why it is the case :-)?

now, as long as I don't get a nice picture of what is going on with mortgage & corp, and don't see any reason why I should worry about whether LSI is involved in the new update. I don't have the keys to figure out how it works in the results where semantic is explicetly used as a factor (using the operator ~) in the first place!

jorock
02-06-2005, 02:23 PM
ok,

1) allintitle:antivirus software
http://64.233.171.147/search?hl=en&lr=&q=allintitle%3Aantivirus+software
panda is not in the listing (176,000 results)
http://64.233.161.105/search?hl=en&lr=&q=allintitle%3Aantivirus+software
panda is in the listing (124,000 results)

2) allintitle:software antivirus
http://64.233.171.147/search?hl=en&lr=&q=allintitle%3Asoftware+antivirus
panda is in the listing (115,000 results)
http://64.233.161.105/search?hl=en&lr=&q=allintitle%3Asoftware+antivirus
panda is in the listing (123,000 results)

In both cases, the results don't match, but panda is in listing.

http://64.233.161.105/search?hl=en&lr=&q=allintitle%3Aantivirus+software
panda is in the listing.

displays panda listing even though "software antivirus" is in there, not "antivirus software". Seems to work in one datacenter, not the other one.



doesn't sound like google is looking at the order in which the keywords are displayed to match the query.

http://64.233.161.105/search?hl=en&lr=&q=allintitle%3A%22antivirus+software%22
strange: panda is listed on the top, even though nothing is bolded. I would have expected in this case that google would respect the order.

A bit strange. But if I had come across this, I would have thought:
on ..105, number of results is roughly the same whatever the order (~120000 results) => looks like google is more accurate and returns what it is actually expected to return: Same number of results. Still looking at the order to rank results => ranking is different: takes other factors into account such as the content of the page itself.

How do you link that to ~, and to lsa?

1. I think it all comes down to "exact matches" are favored during the build.
So pages that just target the main term in links and on page rank better there, when they normalize everything, and finish calculating the synonyms, the sites that use them rank higher.

2. Total number of results returned are down, they haven't included the pages that match the "related" ~ terms.

I think the word order "normailzation" is part of the build process, and so is the case sensitive filenames.
I use this to see the datacenters that are only half done, in those datacenters, sites that make use of ~ words rank lower, when it's done, they rank higher.

It's tough to see now, because it's almost done, we may have to wait for the next build. This happens about once per month for the past 4 months.

They've slowly been increasing the importance of synonyms for the past few months.

IMO, It's not an algorithm change, they just keep tunring that knob higher ;)

jorock
02-06-2005, 02:35 PM
nice picture Nacho, translates in a nice picture what the equation c = n12/(n1 + n2 - n12) means.

The question is: if you draw the circle for the keywords "mortgage" & "corp", you will see that the orange area looks quite small compared to the red & yellow ones.

But "~mortgage" suggests that "mortgage" & "corp" are semantically connected.


Could you please draw me a picture Nacho to explain me why it is the case :-)?

now, as long as I don't get a nice picture of what is going on with mortgage & corp, and don't see any reason why I should worry about whether LSI is involved in the new update. I don't have the keys to figure out how it works in the results where semantic is explicetly used as a factor (using the operator ~) in the first place!

They're connected, but not as strongly as the other terms.
There is an order to it, some terms are stronger than others.

Not sure what to tell you, other than try it.

jorock
02-06-2005, 02:47 PM
I'm not saying this is LSI, or how it actually computes the related terms, I can't prove or disprove what they actually use, I'm just saying they're part of the algorithm.
Though, I'm pretty sure it's computers determining it, and not a bunch of students ;)

Nacho
02-06-2005, 02:56 PM
Thank you Jazar! :)
now, as long as I don't get a nice picture of what is going on with mortgage & corp, . . .
This is how I believe the picture would paint:
http://www.ihispanic.com/sewf/mortgage-corp.gif
. . . and don't see any reason why I should worry about whether LSI is involved in the new update.
I don't see why you should worry about LSI at all. From what I'm getting in this thread is a confirmation that it is to computationally expensive for any search engine to implement such technique in 8 billion documents.

jazar
02-06-2005, 03:03 PM
ahah, ok, thanks again Nacho.

Will have a look at the results this month jorock, and check whether using related keywords (using process described above) improves ranking ... and check the alltitle stuff to trigger measures.

But I don't know, I feel a bit like someone who is told to put his hand on his head, to say a magic word 3 times, and to spin on himself 3 times to get cured from a flu... Well, if I can repeat that 20 times, and if it works more than 95% of the times, I agree that I would be dumb not to use it.

But it is a bit frustrating ...

jazar
02-06-2005, 03:11 PM
oh, an other drawing challenge Nacho

http://www.google.com/search?hl=en&lr=&q=%7Enokia+-nokia
Nokia doesn't seem to be connected to anything.

http://www.google.com/search?hl=en&lr=&q=%7Ephone
But "phone" seems to be connected to Nokia.

How would you draw that?

jorock
02-06-2005, 03:18 PM
ahah, ok, thanks again Nacho.

Will have a look at the results this month jorock, and check whether using related keywords (using process described above) improves ranking ... and check the alltitle stuff to trigger measures.

But I don't know, I feel a bit like someone who is told to put his hand on his head, to say a magic word 3 times, and to spin on himself 3 times to get cured from a flu... Well, if I can repeat that 20 times, and if it works more than 95% of the times, I agree that I would be dumb not to use it.

But it is a bit frustrating ...

Lol, point taken.

But I think the discussion has led to something a bit more accurate than that.
You don't need to wait a month, add a related word to your page and see what happens when it gets reindexed. Play around with how you place them, look at sites, especially spammers, and see how they use them, and you'll find your answers.

This should probably be in it's own thread, arguing the existence of LSI or what they actually use to determine the related terms is different than this.
Though, I'm definitely interested if somebody wants to tackle that. ;)

I'm pretty sure some form of semantics being in place and using the ~ to find them is old news, and was fairly well documented after Florida, it just got lost in all the noise around that.

Robert_Charlton
02-06-2005, 04:13 PM
One search I've been watching since Florida is ~mattresses. Note that Lava Beds National Monument comes up something like #3.

Obviously, on regular searches, Google is not saying that "mattresses" and "beds" are synonyms. But I'd bet that, given two mattress pages, otherwise identical, except that one contained the word "bed" and the other didn't, the one with "bed" would rank higher.

I'd doubt that just "bed" would outrank just "mattress" if the tilde were not used.

jazar
02-06-2005, 04:22 PM
Orion has kindly left a summary of the scientific methodology on an other thread - this starts like this:

1. Gather observations of a phenomenon.
2. Based on the observations, formulate a hypothesis to consistently explain the phenomenon.

You are moving to step 2 Robert, before validating step 1 ...

jorock
02-06-2005, 05:13 PM
Phrases are synonyms too, you have to be careful as you narrow down the terms, stripping out one term takes away the phrase too.

~beds -beds -bed -mattress -bedding -bedfordshire
shows "bedroom furniture"

~beds -beds -bed -mattress -bedding -bedfordshire -bedroom
furniture is gone.

xan
02-06-2005, 05:20 PM
Thank you Jazar! :)

This is how I believe the picture would paint:
http://www.ihispanic.com/sewf/mortgage-corp.gif

I don't see why you should worry about LSI at all. From what I'm getting in this thread is a confirmation that it is to computationally expensive for any search engine to implement such technique in 8 billion documents.

Nicely done :)

Robert_Charlton
02-06-2005, 05:25 PM
Orion has kindly left a summary of the scientific methodology on an other thread - this starts like this:

1. Gather observations of a phenomenon.
2. Based on the observations, formulate a hypothesis to consistently explain the phenomenon.

You are moving to step 2 Robert, before validating step 1 ...

jazar - I have a college background in theoretical math and physics... long left behind... but it's given me some clues about the scientific method. I don't think that the language I was using ("I'd bet..." "I'd doubt") would be confused with the language of a scientific hypothesis. ;)

Rigorously testing a hypothesis often is not possible in SEO anyway, and the scientific method often runs into some limitations here. It is important, though, to separate opinion from fact.

Let's say that I was just tossing out an observation and some thoughts that might be useful for others sitting around the table to use in building their own theories.

jazar
02-06-2005, 06:05 PM
ok, sorry robert, didn't want to sound patronising in any way.

Phrases are synonyms too, you have to be careful as you narrow down the terms, stripping out one term takes away the phrase too.

Same for ~mortage, removing cards is the same as removing credit. co-occurence factor is high within the ~mortgage "circle". Worthless to calculate the co-occurence factor outside of the circle then?

Everyman
02-06-2005, 06:55 PM
I think Google's algorithm works like this:

http://www.google-watch.org/gifs/code3.gif

Mike Grehan
02-06-2005, 07:42 PM
Guys,

I don't have time to read the whole thread here. But if you want to know about latent semantic indexing, I wrote about it (for those who have a copy of my second edition eBook) in the how search engines work chapter.

That was three years ago. And I was inspired, more recently, to go look at my research again after MSN launched.

Susan Dumais, may be one of the most important researchers in this field.

And keywords that we live on right now... may never be the same keywords again, to everyone!

http://lsi.research.telcordia.com/lsi/papers/execsum.html

But, excuse me if this was covered earlier in the thread. I'm on holiday in Venice, Italy, for the carnival and don't have time to read it all just now.

Back with more later.

Cheers.

Mike.

xan
02-06-2005, 07:59 PM
She is a great lady and her stuff is always good. A microsoft chick! I will see her again in April at SEM, and perhaps at SIGIR. I encourage you to look at her papers, very very good.

Of course in this area there is also chen, Horvitz, jurafsky, hearst, the wonderful sir Brill, van rijsbergen,....

If computer science was a football team, they'd all be in mine ;)

Have a great holiday.

Mike Grehan
02-06-2005, 08:11 PM
She is a great lady and her stuff is always good. A microsoft chick! I will see her again in April at SEM, and perhaps at SIGIR. I encourage you to look at her papers, very very good.

Of course in this area there is also chen, Horvitz, jurafsky, hearst, the wonderful sir Brill, van rijsbergen,....

If computer science was a football team, they'd all be in mine ;)

Have a great holiday.

I AM having a great holiday.

But I think your references may be taking this thread off topic again.

However, if it's pure information retrieval we're talking about... Then one of the masters (along with Salton) lives and teaches only 90 minutes drive from where I live:

http://www.dcs.gla.ac.uk/Keith/Preface.html

Now you're talking information retrieval ;-)

xan
02-06-2005, 09:21 PM
Hehehe.. this man (C. J. van RIJSBERGEN) is one of my favorite scientists. Baeza-yates remains in pole position.

Yes, this is off topic.

Back to LSI and semantics.

Nacho
02-06-2005, 11:21 PM
Baeza-yates remains in pole position.
Ricardo Baeza-yates, another great Hispanic superstar in the world of IR and yes, search too!

In his book, Modern Information Retrieval co-written with Berthier Riberiro-Neto he clearly states how "Latent Semantic Indexing is an approach introduced in 1988". This section is well worth taking a look at for those who have not done so yet.

jorock
02-07-2005, 12:24 AM
ok, sorry robert, didn't want to sound patronising in any way.



Same for ~mortage, removing cards is the same as removing credit. co-occurence factor is high within the ~mortgage "circle". Worthless to calculate the co-occurence factor outside of the circle then?

Definitely not worthless to know them, in many cases, the phrases are very strongly related.

example
~antivirus = virus scan

Just saying for the "simpler" screenscraper spec, make sure you can distinguish phrases.

jorock
02-07-2005, 12:42 AM
One search I've been watching since Florida is ~mattresses. Note that Lava Beds National Monument comes up something like #3.

Obviously, on regular searches, Google is not saying that "mattresses" and "beds" are synonyms. But I'd bet that, given two mattress pages, otherwise identical, except that one contained the word "bed" and the other didn't, the one with "bed" would rank higher.

I'd doubt that just "bed" would outrank just "mattress" if the tilde were not used.

The "Lava Beds National Monument" ranks well for beds, but doesn't rank for any of the synonyms

This is good info, it probably shows it's just a tie breaker, at least for competitive phrases.

The questions are,...
if the word "mattress" was on the page, would it rank higher for beds. ;)

~beds = mattress

search10
02-07-2005, 12:57 AM
~games includes "chess" but no other specific game. The top 100 results for "games" yields no sites that feature chess, while card and puzzle games are represented.

How it was decided that only chess merits inclusion in ~games may be a good parlor game, but while ~ has been around quite some time and should not be ignored, there isn't even slight evidence it is particularly in play this past few days.

Robert_Charlton
02-07-2005, 02:30 AM
The "Lava Beds National Monument" ranks well for beds, but doesn't rank for any of the synonyms

This is good info, it probably shows it's just a tie breaker, at least for competitive phrases.

The questions are,...
if the word "mattress" was on the page, would it rank higher for beds. ;)

~beds = mattress

Yes, it's interesting that "Lava Beds National Monument" doesn't come up higher for ~beds... and it might indeed rank higher for that search if it contained "mattress" or "mattresses" on the page. That's of course when searching using the tilde... not necessarily so for default searches, though that's why many of us have been thinking about the tilde since Florida.

I remember in my tilde explorations noticing that "~kitchen" brought up "food." I've optimized for several "kitchen" related topics where "food" wouldn't really be contextually appropriate, but I remember asking myself if working "food" into the page would help.

My guess is that the AdWords Keyword tool Broadmatch suggestions might in fact be a better source for other terms to include.

Robert_Charlton
02-07-2005, 02:37 AM
A follow up thought... as I read the LSI papers, I get the sense that, if something like LSI is used as a weighting factor, what would end up being rewarded by this factor would be a proximity to the norm.

Do those more versed than I in this area feel this is so? If not, what's a more helpful way of looking at it?

jazar
02-07-2005, 05:32 AM
Just saying for the "simpler" screenscraper spec, make sure you can distinguish phrases.

good point!

~games includes "chess" but no other specific game.
the student who is in charge of the word games is fond of chess :D, and despises all the other games.

My guess is that the AdWords Keyword tool Broadmatch suggestions

Have you noticed any similarities between Broadmatch and what ~ returns?

xan
02-07-2005, 09:26 AM
As I suggested before, it is very very unlikely that LSI/LSA is being used to weight any of this as it is well known in the research community that LSI alone is flawed because it doesn't take into account:

the concept space is not understandable by humans.

the information is all numbers without semantic meaning.

performance: The SVD algorithm is O(N2 k3), where N is the number of terms plus documents, and k is the number of dimensions in the concept space.
k will be small, from 50 to 350. As, N grows rapidly the number of terms and the number of documents increase. This makes the SVD algorithm unfeasible for a large, dynamic collection (like a search engine deals with).

General consensus for an optimal number of dimensions in a concept space is unknown. See Dumais (TREC), Deerwester, etc... all the findings are different.

Performing an SVD is simply too time consuming to do on a regular basis and much too expense because of this.

We don't know how many updates we can perform before precision and recall performance degrades (unacceptable to Google).

Deerwester S., Dumais S., Furnas G., Landauer T. and Harshman R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, vol. (41/6): 391-407.

I am not saying it isn't used. I'm saying it definately isn't used alone. Seeing it as the sole way of determining semantic relatedness between words is unrealistic.

Wordnet is often used as part of a method as well as other machine readable dictionaries. Latent would refer to the shortest measure of weights between 2 words.

The Lesk algo will use surrounding words to define the sematic class of a word in that context.
The resnick measure is based on a concept hierarchy.
Jing-conrath measure is based on the shorted path between concepts.
Hirst-StOnge measure the similarity between words in wordnet not restricted to nouns.
Banerjee-pedersen uses the words to the left and right of the target which are known to wordnet.
Pedersen opts for supervised learning methods,
Quillian uses shared words in dictionary definitions.
Niwa and Nitta use content vectors based on co-occurence found in a large corpora.
Agire and rigau use a similarity measure based on conceptual density to work out semantic relatedness in nouns

... The point of that list is to show that there are many ways to measure semantic relatedness, and there are many that I havn't even listed. LSI is the most basic form although revolutionary when introduced by the wonderful Susan Dumais.

I just think that thinking of LSI as Google's way of measuring term relatedness is short-sighted, with all due respect.

The best thing that you can do, is describe your business, the reason for having it there (i.e. N provides quality biscuits to it's customers. They are baked in our ovens ...).

Using links to related sites and so on is perfectly correct, and a clean well presented and coded site is beautiful for an index


;)

hard target
02-07-2005, 10:41 AM
...The best thing that you can do, is describe your business, the reason for having it there (i.e. N provides quality biscuits to it's customers. They are baked in our ovens ...).


;)
Well, that just sound as if coming from google's adv. department - just write relevant contents, don't do any SEO, and you will be at the top... yeah, right.
The fallacy of this argument is in assuming that - just because google's algorithm is (or might be) based on so much theoretical knowledge - it works perfectly (or even just acceptably).
But it doesn't, does it? So, SEO purpose is to find what the algorithm really does (as precisely as possible), not what it's claimed noble purpose is (place the most relevant content on the top od SERPs), and them manipulate the content so that it seems most relevant to google (while still being acceptable to humans).

So, how does one do that?
1. By thoroughly understanding algorithms you mentioned (as I am sure you do) and then running a number of simulations giving different weights to different algos - something so brain- and time-consuming that is unrealistic to expect from SEO business.
2. Superficially understanding the main concepts of IR and then reverse engineering, starting from output, noticing patterns etc...
But to say "the best you can do is describe your business...." - I just don't buy that. What is then the reason for having SEO at the first place?

xan
02-07-2005, 11:15 AM
Well, that just sound as if coming from google's adv. department - just write relevant contents, don't do any SEO, and you will be at the top... yeah, right.
The fallacy of this argument is in assuming that - just because google's algorithm is (or might be) based on so much theoretical knowledge - it works perfectly (or even just acceptably).
But it doesn't, does it? So, SEO purpose is to find what the algorithm really does (as precisely as possible), not what it's claimed noble purpose is (place the most relevant content on the top od SERPs), and them manipulate the content so that it seems most relevant to google (while still being acceptable to humans).

So, how does one do that?
1. By thoroughly understanding algorithms you mentioned (as I am sure you do) and then running a number of simulations giving different weights to different algos - something so brain- and time-consuming that is unrealistic to expect from SEO business.
2. Superficially understanding the main concepts of IR and then reverse engineering, starting from output, noticing patterns etc...
But to say "the best you can do is describe your business...." - I just don't buy that. What is then the reason for having SEO at the first place?


I'm not an SEO. I just mean that if your site is clean that's a big plus. The only thing I am doing here is trying to point you in a more realistic direction, that's all. Notice the semantically related words in my example. As for what is SEO for? To get a favorable ranking using legitimate methods? I'm not from Google advertising. I have nothing to do with advertising. I'm a scientist, a researcher. Do what you like. I'm just sharing some knowledge.

My purpose for being here? My collegues don't believe our business and SEO can ever exist harmoniously side by side. I think SEO's and webmasters can perhaps help because sites being presented a certain way helps us. It's a bit like I have set out to prove it. A bet if you like. We'll see if I can show that it can be possible.

hard target
02-07-2005, 11:43 AM
... I'm not from Google advertising. I have nothing to do with advertising...


Sorry if it seems that I implied that you were - absolutely not my intention. And your comments and references are valuable.
It is just that I didn't agree with one particular statement. It just seems that following your recommendation only, would result in "weak" SEO. Yes, I noticed your semantically connected example. But, this is basically the same answer as the one to to ever repeating novice question - "I have a site selling widgets ; I need more contents - how do I create it?" And the answer is almost invariably - "create a page about history of widgets, other uses of widgets ...." - this really means "create contents semantically connected to widgets". There is obviosly nothing wrong with that. But, SEOs need to go a couple of steps further - optimize the clean site where it is still clean but has the flavor of semantic connectiveness that google and/or other SE prefer - now, how to best di that is obviously subject to (this lively) debate.
I don't believe in "Create it [clean site] and they [SE] will come".

jorock
02-07-2005, 01:05 PM
~games includes "chess" but no other specific game. The top 100 results for "games" yields no sites that feature chess, while card and puzzle games are represented.

How it was decided that only chess merits inclusion in ~games may be a good parlor game, but while ~ has been around quite some time and should not be ignored, there isn't even slight evidence it is particularly in play this past few days.

Sites featuring chess make it in the top 200.

Sites featuring many of the other synonyms are all over the top 100.

~games -chess -gaming -cheats -games -gamer -software -activities -demos

The working theory is it's a tie breaker, and not the sole reason for rankings.

games - chess significantly reduces the number of results returned too.??
Any insight here?

jazar
02-07-2005, 01:26 PM
Xan, have you got integration testers working for you? They will tell you that they are not interested (at least not during working hours) in knowing how clever the code is. They just know that the application doesn't behave the way it should be - they can reproduce this behaviour, and ask you to fix it against functional requirements.

functional requirements for search engines are quite simple: provide for any query the most relevant result to the user. As an integration tester (or SEO ...), my challenge is to find ways to break this application, and to show that I can reproduce a test case where functional requirements are not met.

My training gives me the keys to understand roughly what concepts are behind, but I am not really interested in the beauty of the code, or the algorithms used. all I need are functional requirement. I leave box testing to engineers.

It's interesting to understand what box testing is, but a software will always behave differently in a real life environment. that's why whatever google does, and regardless the amount of effort they put in improving the quality of the application, there will be always ways to hack it.

Now, you must make the distinction between people who are only interested in using hacking technics to spam the SERP, and other who use them wisely just to keep an edge on their competitors.

Most of spammers will not care about getting education. People you feed with your references are already educated SEOs :-), and they already behave well and optimise their sites properly providing good content ...

It is a nice effort to try educating the mob, but as long as you don't provide with a simplified tutorial introducing to complex concepts, you will only build a community of people where most of them "think" they are in the know, and start using words & concepts they don't even understand themselves to try justifying their SEO title (Before I get a return on this, I am not targetting anybody here :).

randfish
02-07-2005, 01:30 PM
My collegues don't believe our business and SEO can ever exist harmoniously side by side.Xan,

I believe that ethical SEOs and search engineers can absolutely get along. I think there is a vast amount of knowledge to be shared between the groups that will help to make the Internet a better place overall.

Let me give some examples:

1. Engineers making SEOs understand that keyword density and keyword stuffing are not valuable, and that term weight is a far better measure of a page's connection to a term/phrase.

2. SEOs helping engineers understand good methods to stop blog spam (rather than nofollow), including seperate indexes for blogs, tags that indicate a software type of blog, patterns that show blog spam, etc.

3. Engineers helping SEOs understand why well-written, natural language is optimal for SEO thanks to advanced sentence structure analysis and its use to tag spam pages.

4. SEOs helping engineers to see blatant spamming/scamming techniques including the 301 hijacking, new techniques for cloaking, re-directs, etc.

I don't feel at odds with any search engineer. Whenever I take on a project, I always ask, what can I do to make this site deserving of the top 10 positions in the search engines, then I make the neccessary changes and additions, make the site 'worthy' and useful to searchers, and begin optimization.

In the long term, search engineering will advance enough so that the quality of a site's content is deserving of top ranks, add it to the websites, and promote them for the world (and the search engines) to see and judge.

Many of you know that I optimize for a site (avatarfinancial.com) in the commercial lending industry - a sector rife with corruption, exorbitant sliminess and lots of dishonest people. I work with Avatar because they're honest people, honest businessmen and fairly priced for what they offer. They won't scam people out of their money, they won't lie and say they can fund when they can't, just to get a down payment. They are direct lenders, so they don't add broker fees and can work with anyone (broker or borrower) who comes to them. When someone submits a loan on the site, my boss calls them up right away and if we can't help them, she refers them to a reputable company in the industry who can. It feels good to work for them because I know that searchers who find our site are faring much better than the searchers who get to our competitors. I'm proud to do the work I do, just like the search engineers are proud to provide great results to people searching online. To me, SEOs & engineers are two peas in a happy little Internet pod.

Don't worry, I know there's exceptions and bad eggs, but that's true in every industry. You just have to rely on the fact that over the long run, honest, useful, strong companies will survive and flourish.

jazar
02-07-2005, 01:36 PM
An other observation:"UK only" search retuns ~games -games -cheats -gamers -activities -demos -chess -software -gaming

It doesn't seem to be related to the initial pool of pages you are running the search on.

jorock
02-07-2005, 01:43 PM
An other observation:"UK only" search retuns ~games -games -cheats -gamers -activities -demos -chess -software -gaming

It doesn't seem to be related to the initial pool of pages you are running the search on.

If I understand you correctly...

It leads back to it's a seperate build, the related words are predetermined in the live search and some sort of dictionary is built that they use till they build it again.

I apologize in advance if this is not what you mean.

jazar
02-07-2005, 01:45 PM
beautiful post ranfish.

I checked the site you are working for and noticed that http://www.avatarfinancial.com/rddetail.php?ANEWSID=16 has got a broken meta tags (hyperlinks in it). Is it on purpose? Never seens anybody adding hyperlinks in description tags before.

jazar
02-07-2005, 01:47 PM
It leads back to it's a seperate build, the related words are predetermined in the live search and some sort of dictionary is built that they use till they build it again.

Yes, something like that. and I would not be surprised that this dictionnay is actually built manually by a bunch of student (ok, just kidding, I stop it).

jorock
02-07-2005, 01:54 PM
Yes, something like that. and I would not be surprised that this dictionnay is actually built manually by a bunch of student (ok, just kidding, I stop it).

This "seperate build" is the point I've been trying to make since my first post, I didn't think of checking other countries.

It takes almost a week to finish, would this help explain the "too expensive" factors?

Again, I'm not saying it's any particular algorithm, just trying to prove related terms are weighted.

search10
02-07-2005, 02:10 PM
Related terms are weighted, and have been for quite some time. There isn't anything new this week about that, even if the degree has been upped, which I doubt.

I'm sure people who have noticed this used to have oddball results like the Lava Beds ones ranking well for beds searches. Other examples are cities or unrelated government orgaizations that have a keyword as part of their name. These used to rank well, but are generally nowhere these days.

jazar
02-07-2005, 02:29 PM
Related terms are weighted, and have been for quite some time. There isn't anything new this week about that, even if the degree has been upped, which I doubt.

doesn't bring any water to the problem here :-(

the index of related keywords seems to be exclusively english (only one available : try ~lit (=bed) on google.fr, only return english stuff).

If I read your post microsoft, and take it as granted, I am going to ask my french client to start using their precious time using related keywords to optimise their site. But since google cannot even provide with french related keywords in the first place throught ~ operator, I doubt that it would be useful to do that.

what do you think?

jorock
02-07-2005, 02:53 PM
doesn't bring any water to the problem here :-(

the index of related keywords seems to be exclusively english (only one available : try ~lit (=bed) on google.fr, only return english stuff).

If I read your post microsoft, and take it as granted, I am going to ask my french client to start using their precious time using related keywords to optimise their site. But since google cannot even provide with french related keywords in the first place throught ~ operator, I doubt that it would be useful to do that.

what do you think?

Does the ~ work in any other languages?

Probably rules out the students doing it, lol.

jazar
02-07-2005, 03:22 PM
doesn't look like it anyway. I have heard that they had asked our student Nacho to do the spanish one, but he was too busy with his drawings.

jorock
02-07-2005, 03:47 PM
Related terms are weighted, and have been for quite some time. There isn't anything new this week about that, even if the degree has been upped, which I doubt.

I'm sure people who have noticed this used to have oddball results like the Lava Beds ones ranking well for beds searches. Other examples are cities or unrelated government orgaizations that have a keyword as part of their name. These used to rank well, but are generally nowhere these days.

So, you think it's part of quality control too?
Interesting.

The ~ and related words were introduced around Florida, would that explain some of the ranking drops after that? The sites had no synonyms?

Not to get into a Florida discussion and go way off topic, I believe it was a backlink over-optimization filter, but if it was combined with the related words, would make it more devestating. ( Like it was )

sweat
02-07-2005, 03:47 PM
I just read this entire thread and though parts flew over my head on first read, I got the basic jist and have some feedback.

I'm looking for a new job involving investment communications, investment marketing, or investment web site content. An asset management firm would be a good fit for me, hence I'm looking at mutual funds. Yesterday, on Google, I did a search for "mutual fund jobs" and "mutual fund careers". For both search terms, I would have expected to see a lot of job/career sites and pages.

Instead, the first 10-20 results were littered with results for non-employment sites and pages. Although there were some employment sites/pages, I was surprised at how many that weren't.

It seems to me that if LSA were being used, sites like JobsintheMoney.com or at least something like the employment section for Fidelity would have comprised the first ten results.

I'm seeing an increased lack of relevancy for other key phrases as well. IMO, if it is being used, there is a very low weighting being attached to it in Google's alog.

jorock
02-07-2005, 04:20 PM
I just read this entire thread and though parts flew over my head on first read, I got the basic jist and have some feedback.

I'm looking for a new job involving investment communications, investment marketing, or investment web site content. An asset management firm would be a good fit for me, hence I'm looking at mutual funds. Yesterday, on Google, I did a search for "mutual fund jobs" and ""mutual fund careers. For both search terms, I would have expected to see a lot of job/career sites and pages.

Instead, the first 10-20 results were littered with results for non-employment sites and pages. Although there were some employment sites/pages, I was surprised at how many that weren't.

It seems to me that if LSA were being used, sites like JobsintheMoney.com or at least something like the employment section for Fidelity would have comprised the first ten results.

I'm seeing an increased lack of relevancy for other key phrases as well. IMO, if it is being used, there is a very low weighting being attached to it in Google's alog.

Maybe you were getting results from the datacenters that weren't done yet.

I'm seeing at least 3 jobs pages in the top 10 for both terms...
Some of them have the words in alt tags and things, high PR probably lets them rank for just about any other term along with mutual funds.

mutual funds jobs
mutual fund careers

I agree it's not perfect, but mutual fund(s) by itself has a lot of natural "brute force" optimization. High PR sites with quality link text to match from related sites. ( localrank )

With exact match "", it's much more accurate.

Debunked
02-07-2005, 05:08 PM
Not to get into a Florida discussion and go way off topic, I believe it was a backlink over-optimization filter, but if it was combined with the related words, would make it more devestating. ( Like it was )

I want to kill the next person who says over-optimization!!

Isn't that an oxymoron?

If you had a backlink problem of having to many than it isn't OPTIMIZED.

Sorry, nothing personal jorock, I just wish someone came up with a different way of saying this....

Back on the tilde~ look at the cached results and you will see that it has to do with an inbound link with that word but then has the LSA derived words on the page. interesting.....

orion
02-07-2005, 05:18 PM
At this SEW thread http://forums.searchenginewatch.com/showthread.php?p=33490#post33490 I mention some limitations and drawbacks of LSI and current promisory work in other areas. I forget to include IS (Information Space) but that's coming.

Cheers


Orion

jorock
02-07-2005, 05:38 PM
I want to kill the next person who says over-optimization!!

Isn't that an oxymoron?

If you had a backlink problem of having to many than it isn't OPTIMIZED.

Sorry, nothing personal jorock, I just wish someone came up with a different way of saying this....

Back on the tilde~ look at the cached results and you will see that it has to do with an inbound link with that word but then has the LSA derived words on the page. interesting.....

It could be 2 scenarios here...

google's default message is "links to the page" when they can't find the term, but it's actually on there, in alt tags, meta's or title.

Or, it's one possible scenario of implementation, goes back to bakedjake's post I think, ( related words ) on page, target phrase in links.

I understand on the OOP thing, bad optimization is more like it, but sites that didn't optimize at all were hit.
www.madstage.com/Companies/Strollers.html
ranked for strollers, dropped during Florida, came back when they lightened up on the filters.
( All their links said strollers )

to stay on the ~ thread...
It's back in the rankings for strollers, and no related words on page, might rule out "related words" being used for quality control, like Microsoft said above, unless it's an anomoly, like the lava beds.

jorock
02-07-2005, 06:57 PM
We should stop using algorithm names, except in theory, we can't prove what it is they are using.
There's a lot of evidence they aren't using LSI, I think Xan and Orion have made that pretty clear.

But, there's a lot of evidence they are using some form of "related words" as factors in the algorithm.

Can the evidence be explained by something other than "related words"?

xan
02-07-2005, 07:11 PM
"Xan, have you got integration testers working for you? They will tell you that they are not interested (at least not during working hours) in knowing how clever the code is. They just know that the application doesn't behave the way it should be - they can reproduce this behaviour, and ask you to fix it against functional requirements."

Yes I do. They return full documentaion on every test, and of course they do understand how the thing works, although not in precise detail, because we work closely together and we have to - they never test without me or a senior researcher.
(ironically Large Scale Integration is LSI)

"whatever google does, and regardless the amount of effort they put in improving the quality of the application, there will be always ways to hack it."

I wouldn't be so sure. There are other ways of dealing with these things.

"functional requirements for search engines are quite simple: provide for any query the most relevant result to the user."

This pretty much rounds up what the research community has been trying to do since the 1950's.

I know the difference between spammers and legitimate SEO.

"It is a nice effort to try educating the mob, but as long as you don't provide with a simplified tutorial introducing to complex concepts, you will only build a community of people where most of them "think" they are in the know, and start using words & concepts they don't even understand themselves to try justifying their SEO title (Before I get a return on this, I am not targetting anybody here ."

Well then maybe my science babble is already a factor in my bet. I never intended to insinuate that you knew nothing. I thought I already did simplify the concepts as well. Others seemed to grasp it, some very well.

You are right about SEO's unsing terms they don't understand, but basing their business on a method that is painfully out of date is a shame. Being on top of the new methods helps people gain insight into how things get done.

I mean when did you read the PageRank algorithm paper?

xan
02-07-2005, 07:13 PM
Guys,

I do realise I'm not an SEO and you might think I have no business being here.

I can keep quiet and watch so you can discuss things amongst you without me butting in every time.

If want to know something ask. Happy to help.

orion
02-07-2005, 07:15 PM
Thanks, jorock.

I was away from Internet access over the weekend when this thread evolved into many disimilar lines of thoughts.

In addition to being computationally expensive and the many drawbacks of LSI (for instance, it fails with terms having different meanings), one can see that often what works on a computer lab not necessarily is ready for prime time on the commercial Web.

I think many above are mistaking the general phenomenon of synonymity association or word association in general with LSI. There are many metrics that account for associations which do not require LSI at all. On-topic analysis, for instance, can be used for both term associations, co-occurrence and for disambiguation and is not that computational expensive.

About Information Space (IS) and LSI, here is a good reference
http://trec.nist.gov/pubs/trec9/papers/newby-t9.pdf


Orion

rustybrick
02-07-2005, 07:15 PM
Xan, its seriously an honor to have you spend time here. So butt in all you want.

jorock
02-07-2005, 07:20 PM
Guys,

I do realise I'm not an SEO and you might think I have no business being here.

I can keep quiet and watch so you can discuss things amongst you without me butting in every time.

If want to know something ask. Happy to help.

I appreciate your feedback, and don't mind the corrections on terminology.
We've presented a lot of evidence, though it's spread pretty thin across all the posts.

Do you think we can rule in or out, the use of some form of "related terms"?

Maybe you can help us with the scientific process of what to do with the information.

I'd like to prove it, or disprove it, and move on.

xan
02-07-2005, 07:35 PM
Thank you guys.

Much appreciated.

Related terms - yes. Just not LSI that's all. As I explained before, its more complex a method. Using semantically relevant terms will help like links in, out helps. Try having a look at the non related terms either side of target terms. Look at the brief explaination of the different methods and see if any make sense in your case.

Personally, I like to use all the patterns in the text and use statistical methods. Semantic relations are like a virtual hunch if you like.

My fellow lab collegues watching me write this think I have entered the lions den and will be eaten alive. I am happy to announce that I am alive and well, and that this community is intellligent and perfectly open to cooperation.

I think my bet might come off.

jazar
02-07-2005, 07:44 PM
I do realise I'm not an SEO and you might think I have no business being here.

I started as integration tester. And it was kind of a war between testers & engineers. Go to the big corps like Microsoft or Motorola, they make sure of that, because it is the key of success. Make sure that both sides challenge each others, and you obtain the best results.

Now, if I see something strange that I can repeat, but cannot quite build a test case for that, I would have gone up stairs, and ask one of the engineer who had worked on the code to give me briefly some clues about what was going on... not handing me over theories books.

Unfortunately, we cannot quite go upstair and ask a google engineer who is working on the ~ operator to tell us how it works, and if there is any connection between these results, and the results shown in the main index.

So we ask you guys :D , please provide us with your own interpretations of the observations, instead of directing to papers & other references.

xan
02-07-2005, 07:50 PM
Not to sound repetitive and heavy going, I still think that its important to stay on top of the new developments.

PageRank was around for a little while. Might have been good to read it back in 1998. The research community is full of little brin's and Page's. See if you can spot them :)

jorock
02-07-2005, 08:07 PM
I'll try to stick to isolate and stick to specific questions.

Do you think it makes sense that the "related words" are a seperate build?

They are predetermined and put in some sort of dictionary?
( It's not live calculations of related words, just a lookup? )

The evidence I have of a seperate build is...

During the build...

1. The allintitle's don't match
allintitle:antivirus software
pandasoftware.com is gone, it's title says "panda software antivirus"

2. filenames are case sensitive
(antivirus will be highlighted, but Antivirus isn't)

3. Total search results for any query is down by a large percentage.

4. Sites that make heavy use of synonyms don't rank well until the above factors "normalize"

5. If the "dictionary" is in the ~ command...
The results are the same for the ~ command in different countries.
They are English only.

6. I don't think the "dictionary" itself changes very often, just the rankings for sites based on it is the seperate build.
( The ~ terms for antivirus have been the same for 6 months )

Please forgive any terminology or spelling errors in advance ;)

jorock
02-07-2005, 08:18 PM
Not to sound repetitive and heavy going, I still think that its important to stay on top of the new developments.

PageRank was around for a little while. Might have been good to read it back in 1998. The research community is full of little brin's and Page's. See if you can spot them :)

I plan on reading everything everybody has posted, I'm just trying to rule things in or out before this thread gets any more out of control than it is ;)

jazar
02-07-2005, 08:20 PM
PageRank was around for a little while. Might have been good to read it back in 1998. The research community is full of little brin's and Page's. See if you can spot them

Quite ambitious Xan. Have you got a revolutionary algorithm to sell? Is this forum really the best place to spot new talents do you think? mmmmh ... don't think so. I would not go to a brothel to find miss world.

jazar
02-07-2005, 08:28 PM
The evidence I have of a seperate build is...

you forgot this one: same results for different pools of pages (example: google.com & google.co.uk / uk pges only).

xan
02-07-2005, 08:29 PM
Quite ambitious Xan. Have you got a revolutionary algorithm to sell? Is this forum really the best place to spot new talents do you think? mmmmh ... don't think so. I would not go to a brothel to find miss world.

I don't sell, I research. I am not looking for talent. I don't care about miss world. There are many promising scientists out there all publishing.

I'm not sure what you mean really.

jorock
02-07-2005, 08:30 PM
you forgot this one: same results for different pools of pages (example: google.com & google.co.uk / uk pges only).

Thanks. I'll add it in.

jazar
02-07-2005, 08:58 PM
Sorry Xan, didn't come out right.

The research community is full of little brin's and Page's. See if you can spot them

I meant, when Brin & Page started thinking about selling their algorithm (and they were researchers too), they didn't post an ad on a forum, they came to meet guys from altavista & lycos, etc .. to ask if they were ready to invest in their PageRank algorithm.

And if I remember, Altavista & co failed to recognise the value of PageRank then. I don't think I would spot the new Page & brin today, specially if they don't come up to introduce me with their findings (and I dont see why they would) ....so....I don't feel really feel concerned by all of that.

ok, what do you think of Jorock' observations?

xan
02-08-2005, 06:18 AM
The two Google founders were Stanford University graduate students in computer science in 1995.

Yahoo! founder David Filo suggested the two grow the service themselves by starting a search engine company. (oh the irony!). It was Andy Bechtolsheim, a founder of Sun Microsystems who gave them their first check.

Of course they didn't post an ad on a forum, no researcher need to, the research community is tight enough to breed its own babies. The investments happen through networking. All research needs funds. I'm not interested in a company, I am happy as a researcher, and would be content in leaving a legacy for others to work on :)

I did my reasons for being here clear, its fair I should after all. What I meant by look for the next brins and page's was watch for new papers and patents.

Ok...next!!!

I am not sure what you mean by "build" really.

During the build...

1. The allintitle's don't match
allintitle:antivirus software
pandasoftware.com is gone, it's title says "panda software antivirus"

I have no idea, I havn't been watching this. To me it makes sense to have Panda Software Antivirus - don't know, not sure I understand - sorry!

2. filenames are case sensitive
(antivirus will be highlighted, but Antivirus isn't)

I am seeing both antivirus software and Antivirus Software highlighted. (I would not make it case sensitive personally)

3. Total search results for any query is down by a large percentage.

No idea what it was before, sorry!

4. Sites that make heavy use of synonyms don't rank well until the above factors "normalize"

Not sure at all what you means - again, sorry!

5. If the "dictionary" is in the ~ command...
The results are the same for the ~ command in different countries.
They are English only.

Not seeing the synonyms in german or french. Most likely because the base algos are very different for each language. English is a very basic language, grammarwise, so its been much easier to research with and develop. Chances are the foreign language system isn't complete. I know we have enough trouble for it. We have a whole other lab working on that part. I wouldn't say its a dictionary method.

6. I don't think the "dictionary" itself changes very often, just the rankings for sites based on it is the seperate build.
( The ~ terms for antivirus have been the same for 6 months )

Dictionaries and straight pattern matching never changes really. Its stable. If anything changes its the method.

That's all I can say.

xan
02-08-2005, 06:28 AM
Thanks, jorock.

I was away from Internet access over the weekend when this thread evolved into many disimilar lines of thoughts.

In addition to being computationally expensive and the many drawbacks of LSI (for instance, it fails with terms having different meanings), one can see that often what works on a computer lab not necessarily is ready for prime time on the commercial Web.

I think many above are mistaking the general phenomenon of synonymity association or word association in general with LSI. There are many metrics that account for associations which do not require LSI at all. On-topic analysis, for instance, can be used for both term associations, co-occurrence and for disambiguation and is not that computational expensive.

About Information Space (IS) and LSI, here is a good reference
http://trec.nist.gov/pubs/trec9/papers/newby-t9.pdf


Orion

I agree with you orion, except on one point: LSI doesn't work in a test environment (lab) either.

And you guys, this will help you decifer the language in scientific papers ;)

Have a look, you'll like it. (http://smurman.best.vwh.net/soga/misc/research.html)

artax
02-08-2005, 10:10 AM
I personally don't thing think LSI is used in Allegra, or the LSI knob is turned up.

First, I think Google uses mostly positive factors when indexing a page. Like if the keyword is in the title tag, multiply the title-weight score with X. Google also uses negative factors, like Spam Detection Thresholds reducing the overall score of a keyword dramatically.

I think if Google would use LSI it would use it as a positive factor.

I have seen one of my websites plummeting from a top 3 keyword to around position 600, where the competition is only about 2000 pages. If it is a positive factor, then it would suggest that 597 sites would have better a LSI factor/score. And *that* I find hard to believe.

Artax

general
02-08-2005, 10:30 AM
Xan- is this close to what you mean:
1) I search the top 50 SERPs for the word "poster frames"
2) I document 7 words to the left and 7 words to the right of each occurence of "poster frames" in all 50 pages.
3) I statistically look for patterns/frequency of these contiguous words excluding the words "poster" and "frames" and stop words
4) I come up with new semantically related words like "prints", "movie", "design", "furniture", "display", "posters" that repeat frequently
5) cross reference these words against synonyms in WordNet
6) cross reference againast clustering search engines (i.e vivisimo, iclusty) for key phrase "poster frames"
7) look for "chunker" patterns (i.e- prepositional phrases)
8) write good relevant text/copy incorporating these related words and chunking patterns

Anything close to what you mean? Also, do think SE's chunk for gramatical patterns- as found in the snippets?

I've been waiting for over year to get into this kind of dialogue with someone who knows something!

jorock
02-08-2005, 11:07 AM
I agree with you orion, except on one point: LSI doesn't work in a test environment (lab) either.

And you guys, this will help you decifer the language in scientific papers ;)

Have a look, you'll like it. (http://smurman.best.vwh.net/soga/misc/research.html)

Lol, now I can at least have the correct terminology for excuses, thanks.

xan
02-08-2005, 11:13 AM
Well done general, that's the idea, definately. Now whether you look at 7 words either side or a different number is something you have to establish for yourself. Try different ones and see where you get a decent collection.

As for the patterns, like prepositional phrases, if the query is very similar, then yes.

Be careful with stopwords, all lists are different. The standard list is WSJ (wall street journal), but IMHO it needs updating. Mine change all the time depending on the corpus.

I don't like the clustering in the clustering engines out there. Not very accurate. Mind you for your purposes it is ideal, because you need a path to go down I guess, and that's not a bad one.

Forgive because sometimes, I forget your aims are different to mine.

Welcome jorock!

jorock
02-08-2005, 11:50 AM
I am not sure what you mean by "build" really.

During the build...

1. The allintitle's don't match
allintitle:antivirus software
pandasoftware.com is gone, it's title says "panda software antivirus"

I have no idea, I havn't been watching this. To me it makes sense to have Panda Software Antivirus - don't know, not sure I understand - sorry!

2. filenames are case sensitive
(antivirus will be highlighted, but Antivirus isn't)

I am seeing both antivirus software and Antivirus Software highlighted. (I would not make it case sensitive personally)

3. Total search results for any query is down by a large percentage.

No idea what it was before, sorry!

4. Sites that make heavy use of synonyms don't rank well until the above factors "normalize"

Not sure at all what you means - again, sorry!

5. If the "dictionary" is in the ~ command...
The results are the same for the ~ command in different countries.
They are English only.

Not seeing the synonyms in german or french. Most likely because the base algos are very different for each language. English is a very basic language, grammarwise, so its been much easier to research with and develop. Chances are the foreign language system isn't complete. I know we have enough trouble for it. We have a whole other lab working on that part. I wouldn't say its a dictionary method.

6. I don't think the "dictionary" itself changes very often, just the rankings for sites based on it is the seperate build.
( The ~ terms for antivirus have been the same for 6 months )

Dictionaries and straight pattern matching never changes really. Its stable. If anything changes its the method.

That's all I can say.

By "during the build" I mean the update. When results are unstable across different datacenters.

I'm just trying to point out consistency in the various datacenters.
There seems to be 2 fairly close matching sets of results during the update.
The allintitle's, result numbers down, are just indicators of the 2 different sets of results.


Check the various datacenters:
www.mcdar.net

Search for allintitle:antivirus software

Compare the datacenters where pandasoftware is in the top 10 for allintitle:antivirus software and the ones where it isn't.

1. The total results for any given query will vary drastically between the two.
2. Rankings are different.
( You don't need to see the past, the different result numbers are listed. )

to stick to observations...
The only thing I'm pretty sure this indicates is a 2 step process to the update.

I've seen this same thing for the past 4 months.

Note: The further along the update is, like it is now, more datacenters have the allintitle's matching, and higher results for queries. The 2 different result sets become harder to isolate.

I just want you to see the 2 different sets of results.

jazar
02-08-2005, 12:15 PM
1) I search the top 50 SERPs for the word "poster frames"
2) I document 7 words to the left and 7 words to the right of each occurence of "poster frames" in all 50 pages.
3) I statistically look for patterns/frequency of these contiguous words excluding the words "poster" and "frames" and stop words
4) I come up with new semantically related words like "prints", "movie", "design", "furniture", "display", "posters" that repeat frequently
5) cross reference these words against synonyms in WordNet
6) cross reference againast clustering search engines (i.e vivisimo, iclusty) for key phrase "poster frames"
7) look for "chunker" patterns (i.e- prepositional phrases)
8) write good relevant text/copy incorporating these related words and chunking patterns

ah, looks nice, even though ... It is clear that much additional work will be required before a complete understanding of the phenomenon occurs... :)

- why specially 7 words on the left and right of each occurence?
- why not directly taking keywords provided by the ~ operator instead of setting up your own process of digging statistically related keywords (except from providing a "universal method" to find related keywords independant from the search engine)?
- Is there any proof that this method will provide results in terms of ranking for the select keyword or keywords phrase?

orion
02-08-2005, 12:40 PM
Originally Posted by xan
I agree with you orion, except on one point: LSI doesn't work in a test environment (lab) either.

I disagree with you Xan.

LSA does work in a lab environment. Cases are discourse passage, for grading text, in the psychology field, etc. It also works in an environment lab in which TREC, free from noise small, very small collections are used.

The problem with LSA is when it is applied for indexing documents, from here the common term of LSI. When we deal with large collections as with commercial search engines, it get debunked. It simply won't successively be applied to large collections full of commercial noise.

Another problem inherent in LSA (or if preferred the term LSI) from its infancy is polysemy. With terms with meanings based on the contextuality there is a problem. So for ambiguous terms it does not work, as pointed out by Prof. Bruce Croft (CS Chair, Univ Mass) (See the LCA paper).

There are many other reasons of why LSI can fail. Despite all the hype, to me it is just another tool in the IR toolbox, no more no less. There are other tools which succeed where LSI fails.

Orion

general
02-08-2005, 01:01 PM
Response to Jazar above:

The use of 7 words to the left & right of the target key phrase is abritrary... I find if I use less I do not get good expansion of related key phrases, I assume [total assumption] that going too far away from key phrase is is something google would not do when they "tag" the location of each word on a page and calculate associated words [associated by location]. just assumptions, as Xan says "tweak to find your own sweet spot".

I use the ~ for very loose guidance, again this is very unscientific, but the first thing I know is google bought Applied Semantics and their paradigm was totally based on WordNet... and they would not waste hundreds of millions to throw WordNet out the window.... I have found major discrepancies between ~ results and WordNet results. I feel more confident leaning toward WordNet synonyms, thinking that this is the underpinning of google's dictionary. Also, why would google publicize via ~ their exact dictionary/basis for the public to spam against. Also, alot of ~ synonyms are out of sense [sense= different meanings to different words].

I have no scientific proof that our higher rankings [and sometimes low rankings] are due to methodology depicted above... but just plain logical assumption that the SE's must go in this direction of looking at words contiguous to the target phrase to interpret the meaning of the target phrase via association. Is "java" an island, or is it a software? or is it a coffee? Logically, if an SE marks the locations for every word on page, it will have markings for surrounding words to "java" and look to these contiguous words to interpret what its meaning is. What about all the spammers who use the same exact keyword phrase in the anchor text 1000 times over... aren't these days over? Having good cohesive sentences with words of associative meanings is the best way for a SE to try to interpret true context?

xan
02-08-2005, 01:03 PM
What i meant was that using LSI for this type of task has failed in labs as well. TREC is notorious for using small collections. I work on the HARD track and the QA track. Its a knowledge base basically. We test LSI way back simply to show it didn't work on dynamic corpora.

Of course using small stable data sets show that it can work, however we now rarely deal with those anymore.

I also wrote a long post about the short-comings of LSI and alternative methods as well. Its now on my blog as well, below my AIML thing.

"LSI is used to address 3 problems: Synonymy and polsemy ( ambiguity). It arranges words into a concept space. Given all of the concepts retrieved, a set of documents can be retrieved. It also Overcomes synonyms and Noise."

Your best bet is baeza-yates for reference, he is the godfather. Few use LSI, unless its part of something specific on a very stable corpus.

SIGIR is better than TREC imo because its the most popular and so attracts a different standard.

I think we're arguing the same end of the stick :)

orion
02-08-2005, 01:51 PM
Agreed.

I know Baeza-Yates, have some communications with him and have his book since he published for the very first time. Often I quote it. I convinced my business partner to finally get his own copy.

Another problem with LSI is that it cannot grasp fractal semantics, regardless of document lengths.

Orion

xan
02-08-2005, 01:57 PM
Hehe...so basically we agree that LSI on dynamic web corpora is pants.

I also reccomend Jurafsky and Hearst, really good. Any respectable computational scientist should have baeze-yates.

dejaone
02-08-2005, 02:38 PM
Brin and Page tried LSI on their first atempt to improve precision of search engine results (which is documented in one of their papers on Standford website). It didn't work well. So they introduced PageRank. A variance of LSI is still part of google search engine. With 8 billions of web document, I doubt they use the orinigal LSI algorithm, but likely a statistical model to proximate the essence of LSI. Computation power won't be the reason to stop search engines from using certain algorithms. In large-scale theory (part of control theory), a complex systems (represented in mathematical models) can be broken down into small and managable pieces.

LSI plays an important role in calculating relevance and co-relation. Based on my anaylsis, however, LSI isn't the major force before Feb. 2 SERP change. My theory is that Google introduced a few criteria to evaluate a site as a whole and factored theose parameters into individual pages. I won't elaberate on this since it's kind of off the thread already.

orion
02-08-2005, 02:53 PM
I doubt they use the orinigal LSI algorithm, but likely a statistical model to proximate the essence of LSI.
Could you provide specific evidence or computational examples of this?

Orion

dejaone
02-08-2005, 03:51 PM
Just spectation. The example would be harder to find. For gigantic database like search engine index database, there's no way to get around without heuristic algorithm or statistical methods. I felt LSI would provide a little better results than what google offer now on some of user queries.

One query I used to look at the relevance is "increase hits". The term is mostly used by SEO to refer to web traffic. Top 10 listings on google have pages talking about "tuitition increase hits student". Those two types of pages fall into two different clusters of page conent.

orion
02-08-2005, 04:14 PM
Just spectation. The example would be harder to find. For gigantic database like search engine index database, there's no way to get around without heuristic algorithm or statistical methods. I felt LSI would provide a little better results than what google offer now on some of user queries.

One query I used to look at the relevance is "increase hits". The term is mostly used by SEO to refer to web traffic. Top 10 listings on google have pages talking about "tuitition increase hits student". Those two types of pages fall into two different clusters of page conent.So mere speculation. You are mistaking query-triggered clustered results with clusters identified from non-queried static collections via LSI.

Your example can perfectly be explained in terms of on-topic analysis. No need for LSI at all.

When one uses very ambiguous queries LSI can fail. Here is where on-topics makes a difference. On-topic analysis allows one to pin-point different clusters of retrieved documents, from the immediate top N ranks, and from what we call the "bulk".

The phenomenon of different clusters triggered by queries can be observed in any search engine, even if they do not use LSI, still only because it is triggered in one engine does not mean it has to be triggered in other engine for the same query terms.

In general, On-Topic Analysis can be carried out regardless of databases nature, size and whether or not LSI is used.

The following particular comment, not intended to anyone in particular. My goal is to help as many SEOs in the industry to get as more educated on AI/IR as possible: It is time for SEOs/SEMs getting a bit more educated on AI/IR issues. As I said to my business partner, if I get more scientists involved in this forum I have accomplished something positive.

Orion

xan
02-08-2005, 04:50 PM
My view exactly Orion.

Instead of fighting SEO manipulation all the time, why not get those who are legitimate to work with us. It certainly would make my job easier.

If things progress as they are, do you think a degree in cmp sci will be necessary??

dejaone
02-08-2005, 04:55 PM
I'm trained and experienced in mathematics, software enginerring and information systems, not much of a marketer. any attemp to explain the behaviors of search engines is black-box approach. we look at the outputs of a system to speculate the internal mechanism of the box. the same outcome can be implemented and explained in different ways.

Even if google implements LSI in the indexing process, it doesn't means it will serve user queries wihtout further processing.

search egines does have a root in IR field. IR won't be able to fully explain how search engines work. For instnace, IR is cercerned with both precision and recall rate, while search engines are primarily concerned about precision (again, personal impression).

jazar
02-08-2005, 05:14 PM
So mere speculation

Is there anything which is actually not speculation here?

I quite like this way of thinking Dejaone. To make your partner reach the orgasm, you don't need to have a diploma in biology in order to explain the mechanism behind it. But a good sensitivity & basic knowledge of how it works will indeed help you find your way and please your partner.

I would go even further. If you try to calculate too much, you take the risk to loose your sensitivity, and may not read what your partner is expecting anymore, and ... never reach the orgasm again.

Just replace partner with google and you get the picture (would be nice if nacho could draw a picture, I don't think Orion has done this one one yet :D ).

xan
02-08-2005, 05:39 PM
SE's use precision and recall.


Precision= the fraction of relevant documents actually retrieved in answer to a search request.

Recall= the fraction of retrieved documents that is actually relevant.

IR systems find documents that are relevant to a user query.
IR uses mainly heuristics, as language is not an exact science.

IR englobes data mining, web mining, digital libraries, data warehousing, ... Anything to do with with a large virtual box of documents that needs sorting.

So yes, SE's have not only roots in IR but they are IR systems.

Precision and recall is how an IR system is evaluated.


Intro to IR (http://www.sics.se/~jussi/Undervisning/IRI_vt04/Overview.html)

dejaone
02-08-2005, 05:45 PM
those who didn't say they have a ph.D don't mean they don't have a Ph.D in the filed. Those who didn't say they're working on some project don't mean they haven't developed billion dolloar software systems.

introducing IR to SEO is healthy, thikning IR is search engines is misleading.

jazar. Try what you're thinking on google. :)

xan
02-08-2005, 05:52 PM
Well search engines are information retrieval systems like it or not.

Many other areas of computer science meet in this field as well.

I don't understamd the phd part though.

Jazar ... matey ... as friendly as you have become with Big G, it has no cognition or sensory abilities, it understands maths and logic.

jazar
02-08-2005, 06:30 PM
never heard of the G point Xan? type G point in google, it is all chinese for him.

dejaone
02-08-2005, 06:33 PM
It's well known that there're four componets of a information systems - software and hardware, the problem it try to solve, users of the system and the process of the implementing the system. The similarity between IR systems an search engines end at software and hardware that run the systems.

the documents in search engines are different from the ones in typical IR systems which are articles from peer-viewed journals or newspapers. IR systems don't have to worry too much about the quality of document. However, they could be anything in a search engine. Presenting relevant information isn't enough for a search engine. they should have reasonable quality.

For IR typical IR systems, we don't have to worry about the impact of update (system release or deployment) on users.

xan
02-08-2005, 07:51 PM
It's well known that there're four componets of a information systems - software and hardware, the problem it try to solve, users of the system and the process of the implementing the system. The similarity between IR systems an search engines end at software and hardware that run the systems.

the documents in search engines are different from the ones in typical IR systems which are articles from peer-viewed journals or newspapers. IR systems don't have to worry too much about the quality of document. However, they could be anything in a search engine. Presenting relevant information isn't enough for a search engine. they should have reasonable quality.

For IR typical IR systems, we don't have to worry about the impact of update (system release or deployment) on users.


Search engines are definately information retrieval systems. The fact that they are in an environment which is dynamic makes the task of retrieving the correct documents much harder. I think it is very very important for users of any IR system to have excellent results. What about medical databases with patient records in them. Collections like the ACM houses are much easier to retrieve bacause they are all in a particular format.
MEDLINE is an excellent example also of a large IR system in action.
Some systems in my opinion need to be even more precise because they are paramount, like the medical records or military data or something. So this can happen, they are marked up.

You're right to say a SE is a black box system, most systems are.

This the family IR belongs to:

Operational information retrieval
Experimental information retrieval
Information retrieval definition
Data retrieval systems
Automatic document classification
Cluster based retrieval
Retrieval effectiveness
Document clustering
Automatic classification
(nabbed for tutorial notes)

I see what you mean about "the impact of update". You refer to the dynamic corpus. I agree, that's the challenge. But this does not make SE's a seperate entity in the field. Weather databases, image retrieval, ...these also are dynamic, but these aren't as commercial.

W. Lancaster 'An information retrieval system does not inform (i.e. change the knowledge of) the user on the subject of his inquiry. It merely informs on the existence (or non-existence) and whereabouts of documents relating to his request.'

search engine -- (a computer program that retrieves documents or files or data from a database or from a computer network (especially from the internet)) [wordnet]

A program that indexes documents, then attempts to match documents relevant to a user's search requests. [princeton]

Wikipedia:

"Web Search Engines such as Google and Lycos are amongst the most visible applications of Information retrieval research."

dejaone
02-08-2005, 10:12 PM
I'd say search engine is much more than IR. Even looking at software part, A traditional IR systems doesn't have a crawler. Part of SEO is to get your pages promptly crawled. I don't know any IR theory that can explain the behaviors of the crawler. From a broad perspective, the Web was perceived as the foruth (in additon to print, radio and TV) media in 1995. Search engine is part of he Web.

We don't know whether SE will be a seperate field., to early to tell. Google's VP of technology said 95% of search technology haven't developed yet.

dannysullivan
02-09-2005, 05:37 AM
Precision and recall is how an IR system is evaluated.

But it's not the best way to evaluate search engines, in my view.

We don't know what the "relevant documents" are to begin with. Relevant because they have the words on the page? Relevant because they don't use the words but those words in link point at them? Relevant because while they lack the words we searched for, they are about what we want? Which definition do we go with.

In a lab, which is where we'd see people like TREC (http://trec.nist.gov/) try to evaluate systems, maybe getting all into precision versus recall makes sense.

Dealing with web search, it doesn't. You've got an environment where anyone can dump in content in a variety of formats, where people are both purposely and accidentally misleading search engines and every searcher has a subjective view of what they'd consider relevant for any particular query.

Search engines are IR systems, but using any kind of traditional IR metrics makes little sense to me. Instead, I've argued that you need a battery of non-traditional tests to try and determine this.

The What is relevancy? (http://forums.searchenginewatch.com/showthread.php?t=3606) thread we had here has a lot more info on that and how people have been trying to measure web search relevancy.

xan
02-09-2005, 10:17 AM
Agree with you Danny.

In labs we do use precision and recall, and that's one way of assessing results. We have lots of problems with exact ways of measuring success. This also goes with my field, using the turing test. The problem with recall is that it assumes that there is an ideal data set.

"Because the engines perform their search on overlapping (but different) subsets of the web collected at different points in time, evaluation of search engines poses significant challenges to the traditional information retrieval methodology."
(Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval )

The main problem with search engines is that they return documents that are relevant to a user query, not answers to user questions.

SIGIR is probably one of the best conferences for IR, and as you say, together with TREC, relevancy is always a discussed issue. Metrics are used :

]Evaluationg IR systems in a dynamic env. (http://www.dcs.bbk.ac.uk/webDyn2/proceedings/barilan_evaluating_IR.pdf)

But automated classification, relevance sets (content of the retrieved results rather than by looking for pre-specified relevant pages), recall and precision based on binary relevance, a discount factor applied to the relevance scores, etc ... are also proposed solutions.

I suppose at the end of the day its about what the user wants from a search engine and how well it is provided. Non-SEO's do not seem to complain about results much. I would imagine that a good test might be to get SEO's to judge results and non-seo's as well.

Traditional IR systems do have crawlers to update their index. Search won't be a seperate field in my opinion, just like neural network systems will always be part of A.I.

dejaone
02-09-2005, 01:15 PM
AI vs. Expert Systems is a better analogy for IR vs. serach engines.

randfish
02-09-2005, 01:16 PM
I suppose at the end of the day its about what the user wants from a search engine and how well it is provided. Non-SEO's do not seem to complain about results much. I would imagine that a good test might be to get SEO's to judge results and non-seo's as well.Xan,
I'm not sure about this - I hear regular users complain about search results all the time, but they have no public forum in which to do so where IR researchers might hear about it. They also have no idea of the possibilities of web search, and with nothing to compare it against, they simply put up with the results they get. If you were to show modern web searchers (which is now a majority of the US and other first world countries' populations) the kind of search technology that will hopefully exist in 5-10 years and have them compare it to current results, they would certainly be dissapointed.

SEOs, like many techies, have intitmate knowledge about the world of search, and therefore, higher expectations in general. Combine this with their bias, and you can see why compalints from the community are overbearing sometimes.

Even without the built-in SEO bias, however, many technically adept folks on the web are often dissapointed. You'll notice the complaints of bloggers towards search as they grow more and more technically and web-adept.

Luckily, we have people like yourself who are working to fix this problem. And people like me and my colleagues who are (when we're at our best) working to make great content easy for you to deliver.

orion
02-09-2005, 03:03 PM
I used to explain the difference between SE and IR systems to my grad students with similarity and difference analogies. Here is one

Like humans, some animals have 2 legs, but not all animals with 2 legs are human.

All SE are IR systems but not all IR systems are search engines.

So how we reduce the technology divider between the two? There is one way: using query-sensitive similarity measures.

I have with me a copy of a relatively recent PhD thesis form a brilliant folk (2002) in which query-sensitive clustering is used to address the so-called Cluster Hypothesis.

In general, using query-sensitive techniques (on-topic is one) involves the user and his/her perception of relevancy. This approach has more advantage over traditional IR as currently implemented in lab settings. It is a promisory approach.

Orion

xan
02-09-2005, 04:23 PM
Hey,

you're right randfish, users can't really address their thoughts like you guys can. Perhaps they aren't satisfied either. I have no idea really on that front. I do know that no one in research is satisfied anyway :)

I have worked in IR for quite some time, but I don't work directly in search. I work in a slightly different area, which uses many IR techniques in dynamic environments.

Orion, I like your post. I also have an analogy for students who have trouble with logic problems:

"If you call a dog's tail a leg, how many legs does the dog have?"

and I leave it like that for a while, we come back to it at the end of the session:

"4. calling a tail a leg doesn't make it one."

dejaone
02-09-2005, 05:30 PM
"A Search engine is a computer system". The statement is logically correct and it's also meaningless. The statement "a search engine is an IR system" (logically correct) does offer insights from ONE perspective, but it also misses many important aspects of the whole picture of the search engines.

xan
02-09-2005, 05:54 PM
"A Search engine is a computer system". The statement is logically correct and it's also meaningless. The statement "a search engine is an IR system" (logically correct) does offer insights from ONE perspective, but it also misses many important aspects of the whole picture of the search engines.

That's because it's a specific area of IR. Same as "genetic algorithm" isn't perfectly defined by "Artificial intelligence".

dannysullivan
02-10-2005, 06:34 AM
Non-SEO's do not seem to complain about results much. I would imagine that a good test might be to get SEO's to judge results and non-seo's as well.
I definitely hear regular searchers complain, but not as loudly or as often.

One reason is that for many of them, it doesn't occur to them that there's any forum they should go to.

A second reason is that I've seen some research in the past that they blame themselves. The search failed because they don't search very well.

Personally, I think a bigger reason you don't hear this so much is because often they find stuff that's good enough. It's not the best -- but they aren't an expert in an area, so they take what they get and are relatively happy.

I have this canaries in a coal mine metaphor I've long used when it comes to search health. SEOs and librarians are the canaries. If anything goes wrong in the search engine coal mine, they know it first.

Librarians, because they are often experts in areas, use search regularly and understand if something seems wrong.

But SEOs intimately know the area they want to rank well for. Yes, some who are new may judge everything based on, "My site's not there, this sucks." But others will know the difference between their site not showing up and the results just generally not listing a variety of sites that should be there.

It's unfortunate that the group that knows search engines the best have this self-interest conflict that naturally makes people want to dismiss their concerns. I don't mean that negatively -- it's just a fact that if you're an SEO, and you point something out, someone's going to immediately assume you're upset because you personally don't rank well.

hardball
02-10-2005, 08:29 AM
I definitely hear regular searchers complain, but not as loudly or as often.

Most businesses know that customers complain with their feet before opening their mouth, I've heard anecdotally that for every one person who complains, ten have just stopped being a customer. I wouldn't expect SE loyalty to be much different. If I can't find something obvious like a domain name (doh!), am I to feel like a twit or do I just go elsewhere?

eZeB
02-10-2005, 12:38 PM
General -- I've been doing something similar for the last couple of months -- Looking at the clustering is interesting. I all the browser visible text from the top 12 - 15 ESPECIALLY directory pages, also ~ search, google suggest and Overture.

I use statistical patterns in the text overall, and don't pay attention to the surrounding 7 words. Eliminate adverbs, adjectives, articles etc. Make a short list, and calculate both c-index and a straight co-occurence %. If a certain word appears on 65% of the pages with my keyword phrase then I want that word on my page, semantically related or not. I take a broad view and include words if in doubt -- extra words on the page don't do any harm.

A couple of observations (my experience only) :

1. the ~ search is unreliable and/or incomplete. Often it turns up words with very low c-index and high c-index words and always omitted. Dozens of examples. Perhaps G has partially disabled the ~ search much as they have with the site command?

2. G is very sensitive to related keywords on the page. too many and you go to the bottom.

3. Link text still rules. LOTS of top 10 sites show NO evidence of LSI or related words.

4. Before this update, LSIish SEO would get sites from #50 to #10, but if sites were already in the top 10 it wouldn't get them any higher. Now it will.

general
02-10-2005, 01:27 PM
Thanx eZeB for observations.... please if anyone else has "objectively" used related or sematically related words with success- lets hear. The operative word is if you have not just sprinkled words on a page that you think are LSI'ish but have a system for determining what words to use, how to apply them, and subsequently how to measure the methods success.

eZeB- I agree with using both words that are "semantically" related and "related by usage" [but not necessarily semantically related]. These "related by usage" words would be generated/results of 3D vector analysis or lsi algorithm- and therefore become semantically related. In my SEO world of "poster frames" the word "movie" is closely related as in "poster frames for movie posters". "movie" has no semantic or synonym'ish relationship in real world dictionary however but research shows me it is "related by usage" and therefore probably in Googles Semantic Dictionary. I agree with you on throwing out articles, adverbs, but can go either way on "adjectives". The example above shows how "movie" is used as an adjective.... but alot of adjectives are also very universal almost like Stop Words.

When you say "browser visible text" are referring to just the text in the SERP page, or text on the URL page less the metas?

What are you seeing for in the link text these days-- keywords or related or both?

randfish
02-10-2005, 01:49 PM
2. G is very sensitive to related keywords on the page. too many and you go to the bottom.

4. Before this update, LSIish SEO would get sites from #50 to #10, but if sites were already in the top 10 it wouldn't get them any higher. Now it will.eZeB - These are very dramatic, and somewhat conflicting (in my mind) statements. Can you explain this a little further and tell us why they aren't mutually exclusive?

eZeB
02-10-2005, 03:44 PM
General -- it is interesting to see the patterns of usage. Orion is the expert not me! What I find is words like HOME, LINKS, RESOURCES are found almost everywhere which doesn't imply semantic connectivity however, for a search like "eating disorder" it seems to me it is connected since someone is going to search for "eating disorder resources".

Orion, I have read and re-read ALL of your stuff and you have mentioned this before and I respecfully disagree. Always interested in what you have to say tho...

General -- by browser visible text i mean text display in the browser on each of the top 10 or 15 pages. Although, I did read somewhere they META keywords and description are sometimes used to give a 'general impression' of the site.

Randfish -- I have enjoyed your other postings. In the months before this update i tried numerous formula with my short list of words. I found that stuffing too many high c-index words in hot areas on the page, the site tanks.

Experiementing with 5 or 6 different sites, it appeared to me as though a site at #7 or #8 wouldn't budge any higher no matter what I did, whereas adding words from my shortlist gave other sites a big boost. After this update, and careful placement of words, a site that hovered in the #7 - 10 range is #5 out of 1.5 million.

Another site, also in the #7 - 10 range fell to #32 because I put too many words from my list. I removed many of the words and it is #3 out of 9.5 million with this update.

orion
02-10-2005, 04:11 PM
Hi, ezeb

A c-index is a normalized co-occurrence. For instance, for a 2-term query in AND mode it gives the fraction of retrieved documents in which the two queried terms co-occur, anywhere in the documents and without regard for order and proximity.

Co-occurrence suggests that the terms could be related, whether or not the two terms are synonyms is irrelevant. When the terms are indeed synonyms, co-occurrence is measure of synonymity.

To reassure semantic association between terms you need to conduct an on-topic analysis for the terms and the document(s).

Orion

orion
02-10-2005, 04:32 PM
eating = k1, n1 = 31,700,000
disorder = k2, n2 = 20,100,000
eating disorder = k12, n12= 4,470,000
c12-index = 94 ppt

Sounds well related to me.

For a 3-terms query in AND (FINDALL) you need to compute 4 different c-indices, which give you different information. If you want to convert this to a two-term query you need to declare two of the terms in EXACT (use quotes) within the FINDALL query mode. This, however, introduces a degree of ordering (and bias) in the computation.

Additional info will be presented at SES NY and possibly in other activities after the SES.

Orion

PS Results for Google. Correction: FINDALL is AND, I mistyped ANY for AND. One letter typo makes a huge difference. Sorry about that. Already corrected.

eZeB
02-10-2005, 05:05 PM
Thanks Orion -- always very interesting and somewhat humbling! I see your reply about 3 word phrases and recall the paper which you give the equations.

What do you think about this tool -- http://www.googleduel.com ? Quite an interesting app. The results are quite different from the c-index. They aren't measuring connectivity but straight co-occurence is that right?

general
02-10-2005, 06:08 PM
eZeb- at the risk of your competitors looking in, I did a quick search of the top 200 resulting websites, over 1200 occurences of the phrase "eating disorder" searching 9 words to the left and 9 words to the right of the target key phrase "eating disorder". Here are the results, let me know if they look related, synonomys:

1) binge= 522 occurences; most frequent use is 444 occurnece left of taget phrase 1 word
2) people= 152 occurences; most frequent use is 102 occurences left of target prhase by 3 words
3) eating= 171 occurences; most frequent use is 59 right of target phrase by 3 words
5) disorder= 127; 43 @L4
6) CENTER= 47; 12 @R1 AND 18 @R3
7) people= 43
8) treatment= 37
9) neruosa= 33
10) bulimia= 30
11) disorders= 30
12) referral= 27
13)= causes= 27
14) person= 26
15) revovery= 25
16) anrexia= 25

I dont see "RESOURCES", let me know what you think

general
02-10-2005, 06:15 PM
Orion- Help us simpletons

"poster frames"= 460,000
"movie"= 186,000,000
"movie" and "poster frames"= 621,000

621,000/186,460,000 = .003 (not very related??)

Did I do this correctly?

eZeB
02-10-2005, 06:50 PM
General -- I go to the page and CTRL A and then CTRL C so I get all the text and then process it without regard for proximity.

I have some excellent results, an evolving methodology, and I intuit a relationship between a number of factors but am having difficulty sorting it out.

Looking at the top 15 sites, (I excluded 4 for one reason or another) leaves 11 sites, and 8 of those have the word "resource" or "resources." That leads me to think it is an important word.

HOWEVER, looking at the sites quickly now, I see my select all technique includes link text and therefore does not differentiate between the sense of resources, as in links, which i don't want, and the sense of "eating disorder resources" which I want and is related to eating disorders in some way.

Why don't c-indexes stem?

eZeB
02-10-2005, 07:01 PM
General -- I see you got a pretty good list and I think I have all of those -- but there are a whole bunch of others with very high c-indexes, (depression - c-index=57.47) but I will have to re-calculate the c-index on the 2 word phrases after Orions post. Back to the drawing board.

"People" doesn't seem to fit in ???

orion
02-10-2005, 07:12 PM
Orion- Help us simpletons

"poster frames"= 460,000
"movie"= 186,000,000
"movie" and "poster frames"= 621,000

621,000/186,460,000 = .003 (not very related??)

Did I do this correctly?Hi, general.

C-indices can be used to determine the correct combination of terms and phrases in a document or database.

Use symmetry to target the required phrases as single keyphrases. Note that each phrase is declared in EXACT within an AND mode. The correct query target for paid ads and SERPS in documents is

k1= "movie poster", n1=2,620,000
k2= "poster frames", n2 = 462,000
k12 = "movie poster" "poster frames", n12= 103,000
c12-index = 35 ppt

Fairly connected. Right? Note the symmetry applied around the term poster. Symmetry must be applied on a case basis. It not always works with poorly related concepts.

More on this at SES NY. Sorry.

Orion

general
02-10-2005, 07:14 PM
which makes them different from people with anorexia (another eating disorder in which the person does not eat).

Anorexia Nervosa Bulimia Nervosa Depression What is binge eating disorder? People with binge eating disorder often eat an unusually

What is binge eating disorder? People with binge eating disorder often eat an unusually

People with binge eating disorder often eat an unusually large amount of food and

be struggling with a common eating disorder called binge eating disorder. What Is Binge Eating Disorder? Lots of people find

For people with binge eating disorder, at first food may provide sustenance or comfort, but

programs are helpful for some people affected by binge eating disorder, children and teens should not begin a diet or

step organizations residential treatment centers churches, temples and synagogues eating disorder web sites Ask for people you can talk with

that probably affects millions of Americans. People with binge eating disorder frequently eat large amounts of food while feeling a

It goes on forever, 152 occurences in top 200 google results... you can use the search tool to find these websites

general
02-10-2005, 07:29 PM
You are right, I loosened my parameters and came up with the following results:


binge eating disorder Binge eating disorder National eating disorder sleep-related eating disorder local eating disorder common eating disorder another eating disorder disorders eating disorder professional eating disorder International eating disorder child’s eating disorder eating disorder Referral eating disorder Information eating disorder Recovery eating disorder Awareness eating disorder Meetup eating disorder Center eating disorder Survivors eating disorder Eating eating disorder symptoms eating disorder Resources eating disorder treatment

"Resources" does eventually show up [note: it is case sensitive, not that the search engine will care]

dannysullivan
02-11-2005, 06:06 AM
Note to those entering this thread not from the beginning. There are other threads discussing the recent Google changes from the perspective on non-LSI/LSA issues. For a guide, see What's Going On With Google: Feb. 2005 Update (http://forums.searchenginewatch.com/showthread.php?t=4153)

orion
02-11-2005, 01:46 PM
Some posts at this thread deal with keyword co-occurrence (KC) issues not with LSI, and are a bit off-topic. We could split this thread but is a bit long. I answered some questions but feel future post on KC are better served in the corresponding Keywords Co-Occurrence (http://forums.searchenginewatch.com/showthread.php?p=34140#post34140) thread. Serve this as a pointer and lead for KC questions, so please let's try to stay on-topic.

As Danny mentioned, there are other threads that revisit many of the above issues and that are not related with LSI.


Orion

2much
02-12-2005, 10:07 PM
I'm confused.

"Play around with the tilde queries. "baby clothes", "infant clothes", "infant apparel" leads to some interesting results."

I got the top 30 results and used a tool that shows me comparisons. The cross-over is very small (only a couple of large sites, bizrate being one of them).

If LSI is part of the new algo, and is already live as some people are claiming, wouldn't it be applied with these 3 queries, as essentially they mean the same thing?

Adding the tildes to these queries does seem to show a compilation of the "top" authority sites in the "category". They seemed to combined a few, but still the overlap isn't huge.

I would love some more ideas so I can get "un-confused" , as this is puzzling me.

Nacho
02-12-2005, 10:57 PM
Don't worry Marcela, LSI is not confirmed and practically impossible to be part of the new algo, and the ~ results is just a command operator for proximity but nothing to do with LSI.

xan
02-13-2005, 10:33 AM
Agreed Nacho. I wouldn't waste too much time on it.

Michael Martinez
02-14-2005, 12:02 PM
Don't worry Marcela, LSI is not confirmed and practically impossible to be part of the new algo, and the ~ results is just a command operator for proximity but nothing to do with LSI.

There is increasing evidence for LSI or something like it in Google's service. Try this as a query: Ask a vague question, get an answer. Don't go to Google answers. Just browse the pages that come up in the SERPs.

Examples:

http://www.google.com/search?num=20&hl=en&lr=&newwindow=1&safe=off&c2coff=1&q=where+can+I+buy+something%3F

http://www.google.com/search?num=20&hl=en&lr=&newwindow=1&safe=off&c2coff=1&q=how+do+I+help+my+mother%3F

http://www.google.com/search?num=20&hl=en&lr=&newwindow=1&safe=off&c2coff=1&q=where+is+the+color+blue%3F

That last example is a little like tripping on acid, I suppose, but you get some interesting discussions of the color blue.

http://www.google.com/search?num=20&hl=en&lr=&newwindow=1&safe=off&c2coff=1&q=how+large+is+a+house%3F

http://www.google.com/search?num=20&hl=en&lr=&newwindow=1&safe=off&c2coff=1&q=do+dinosaurs+eat+mice%3F

Obviously, this question could be taken one of two ways: as a joke (since it is in the present tense and we know there are no dinosaurs today) or as a poorly constructed query about the behavior of the dinosaurs (when they lived). Google took it as a request for humor concerning dinosaurs and mice.

Of course, you can come up with plenty of examples where Google doesn't have anything relevant to offer:

http://www.google.com/search?num=20&hl=en&lr=&newwindow=1&safe=off&c2coff=1&q=Where+does+the+duck+quack%3F

Yes, the keywords "duck" and "quack" are found in the SERP results, but there is nothing really relevant to the question (which is grammatically sound but is, in the general community repository, a nonsense question). It would make a good title for a children's book, and if there were such a book, a reasonable person would expect it to come up in the SERPs. So far as I know, there is no such book.

http://www.google.com/search?num=20&hl=en&lr=&newwindow=1&safe=off&c2coff=1&q=is+sound+my+mother%3F

Here, there is no answer to the question, but Google attempts to provide pages which associate "sound" and "mother" in a meaningful way. That is, it isn't simply serving up pages that use both words. It is serving up pages which use both words meaningfully.

Whatever it is that they are doing, it still needs to be refined. For example, if you want to learn more about efforts to clone mastodons, don't use the following query:

http://www.google.com/search?num=20&hl=en&newwindow=1&c2coff=1&safe=off&q=replicating+a+mastodon&spell=1

The most relevant result in the top ten is the (as of this writing) seventh entry, which addresses DNA replication but only makes reference to a link which includes the word "mastodon" twice and leads to a site about mastodons.

On the other hand, if you were to use this query:
http://www.google.com/search?num=20&hl=en&lr=&newwindow=1&safe=off&c2coff=1&q=cloning+a+mastodon

You'd get immediately relevant results.

So, Google is doing something with the analysis of text (including text embedded in link URLs) and they are improving their performance substantially. A year ago, I wasn't seeing results like these. I find myself more and more frequently typing in questions at Google and getting relevant answers (a service Altavista pioneered years ago, but which seems to have fallen into the shadows).

As far as Google is concerned, the search is no longer simply about matching sets of keywords. Order of the words in the query is no longer important, as Google is now identifying concepts. They HAVE to be using something like LSI, even if it's not that sophisticated, in order to identify and match concepts.

Nacho
02-14-2005, 12:30 PM
They HAVE to be using something like LSI, even if it's not that sophisticated, in order to identify and match concepts.
I'm sorry Michael, but your examples and nothing I've seen so far is enough evidence to prove Google has adapted LSI.

Michael Martinez
02-14-2005, 12:49 PM
I'm sorry Michael, but your examples and nothing I've seen so far is enough evidence to prove Google has adapted LSI.

I expected as much in the way of response, since it is not possible to demonstrate from external evidence what Google does. We can only document behaviors and infer possible reasons.

However, my point is that Google has now moved beyond simple keyword and phrase matching.

They are identifying concepts and matching concepts. Anyone who reads their Web server logs should begin seeing a shift in the kinds of query strings which bring referals to them from Google. The shift will depend on how extensive and flexible your content is. That is, the more you rely on spam and doorway pages, the less flexible your content is, and the less of a shift you should see.

Concept spamming will probably become a big topic by the end of this year, especially if Yahoo! and MSN follow in Google's footsteps (but it may take them at least another year to roll out comparable technology -- by which time, Google should be ready with the next generation).

zamolxes
02-14-2005, 12:55 PM
I agree - I'm amazed how fast this rumor has spread (LSI and the last update) without any real evidence! Nothing I have seen in this thread or anywhere else can really prove a substantial google LSI change. People love to speculate and come up with apocalyptic theories after every major Google update and they often are totally unfounded.
As written by others, the serps relevance has not really improved and the ~ queries are no different from months ago.

ferret77
02-14-2005, 12:58 PM
are you serious they just look like search results

the blue one returns a bunch of pages with the word blue on them

same with the buy something one

"Concept spamming will probably become a big topic by the e"nd of this year,"

what?

Michael Martinez
02-14-2005, 01:40 PM
are you serious they just look like search results

What should one expect when running a query at Google?

the blue one returns a bunch of pages with the word blue on them

No, all the pages are DISCUSSING the color blue. Random results would look substantively different. Google identified the concept of "the color blue" in my query and provided me with a list of discussions about the color blue.

same with the buy something one

Right. I asked where I could buy something, and Google listed major merchandise providers. Not affiliate pages. Not sociological papers on people's shopping hobbits. Not random essays and journal entries on personal shopping trips. Not stories where some character says, "Where can I buy something?"

The point of these examples is that Google did not simply try to match keywords. Google identified concepts (where it could) and then served up results which were related to those concepts.

As I demonstrated, the system is not perfect, but it is vastly improved over a year ago.

"Concept spamming will probably become a big topic by the e"nd of this year,"

what?

As people in the SEO community begin to recognize what Google is doing, they will begin developing content that targets concepts.

I have some ideas on how that might be done today, but I'm not going to share them. In a few months, if I have time to do some experimenting, I may share some insights into how the average Webmaster can use Google's concept engine to their advantage. I have no doubt SEO people will start using it to THEIR advantage, too.

orion
02-14-2005, 01:43 PM
They are identifying concepts and matching concepts...

Concept spamming will probably become a big topic by the end of this year...In my view, the issue is not whether search engines are implementing or not concept recognition or conceptual matching/mapping. It is how is implemented. LSI is only one implementation, with many pro's and con's. There are plenty of other methods that can deliver, without the drawbacks of LSI. I mentioned in other SEWF thread (http://forums.searchenginewatch.com/showthread.php?p=34422#post34422) this, but I feel this quote needs to be revisited carefully:

In the Local Context Analysis (http://citeseer.ist.psu.edu/cache/papers/cs/2875/http:zSzzSzwww.cs.umass.eduzSz~xuzSzlca.pdf/xu00improving.pdf) paper and discussed in the LCA thread (http://forums.searchenginewatch.com/showthread.php?t=2030), Bruce Croft (Distinguished Professor and Chair, Department of Computer Science and Director, Center for Intelligent Information Retrieval University of Massachusetts, Amherst) writes


2.2 Dimensionality Reduction

"….Despite the potential claimed by its advocates, retrieval results using LSI so far have not shown to be conclusively better than those of standard vector space retrieval systems. As with term clustering, word ambiguity is also a problem with dimensionality reduction techniques. If a query term is ambiguous, terms related to different meanings of the term will have similar reduced representations. This is equivalent to adding unrelated terms to the query."

While this was discussed in the context of query expansion and term discovery, it also applies to the above thoughts. LSI has been under critiques since its inception. It works good for individual documents and short collections, fails miserably with gigantic collections filled with commercial noise.



Orion

Michael Martinez
02-14-2005, 08:00 PM
While this was discussed in the context of query expansion and term discovery, it also applies to the above thoughts. LSI has been under critiques since its inception. It works good for individual documents and short collections, fails miserably with gigantic collections filled with commercial noise.


Orion

Well, I'm not going to get back into the question of whether Google is using LSI, as I have nothing new to contribute to that discussion. But the World Wide Web is hardly "filled with commercial noise". According to the recent studies I have looked at, the majority of content continues to be informational, non-commercial in nature (although the meaning of "informational" is not well-defined -- I suppose that is intended to include everything from your neighbor's description of her doggie's special bark to the typical teenage blog-speak random thought page).

orion
02-14-2005, 09:41 PM
FYI

Note I used the term gigantic collections, not the Web (WWW) Note also the plural in "collections".

Collections refers to repositories, in this case search engine repositories and yes, the big ones such as Google, MSN and Yahoo! are filled with commercial noise.

Orion

Michael Martinez
02-14-2005, 10:34 PM
FYI

Note I used the term gigantic collections, not the Web (WWW) Note also the plural in "collections".

You said "gigantic collections filled with commercial noise".

Perhaps I should have asked what you mean by "commercial noise", but since you qualified "gigantic collections" with "filled with commercial noise", I was merely pointing out that the Web (or whatever the search engines have captured of it) is not "filled with commercial noise".

MOST of their content is NON-commercial in nature.

Unless your definition of commercial noise is something other than business Web pages selling services and products. A great deal of "business site" content consists of press releases, staff biographies, company histories, faqs, and other informational pages. But there are extensive archives of mailing list discussions, news group discussions, forum discussions, and term papers, reviews (of books, music, movies, tv shows, web sites, people, etc.), feature articles, statistical tables, etc., etc.

Universities, libraries, government agencies, non-profit organizations, churches, and other groups (and individuals) make available huge collections of documents which don't include any "commercial noise" (if I understand your use of the term correctly).

Now, if you are using "commercial noise" more broadly than I have inferred, then it would be helpful to understand what you are referring to, as I really have no wish to engage in a pointless disagreement where we end up realizing we are talking about apples and oranges.

orion
02-15-2005, 12:47 PM
Fair and good question.

"Major Google Changes: Latent Semantic Analysis? is the title and topic of this thread. The LSI implementation as it relates to SERPs is the main concern of posters at this thread. The term "gigantic collections" was not meant to be taken for The Web as you, Sir, wrongly inferred.

True that commercial search engines have collection sections with documents with few or non commercial noise (churches, non-profit, gov sites, etc). However, when you do a generic search from the default user interface of Google, MSN, Yahoo! you get results in which non commercial results are tainted and mixed with commercial results. One of many reasons of why the LSI implementation is a futile exercise. Full of noise, yes. That you can remove this noise by executing command searches or by querying a specific section of the engine (e.g., Google Scholars) is a different thing.

Now since this thread is about whether or not Google is implementing LSI (or LSA if you wish) let please stick to the subject. If you feel you have a point at other issues, please feel free to open a new thread for that purpose.

Orion

xan
02-15-2005, 03:31 PM
alright. Back on topic - Answer to the question: No.

bethabernathy
02-25-2005, 10:18 PM
No drops with my sites, just increases. :confused: and :)

jes1111
03-17-2005, 12:09 PM
:eek: I've read this thread with great interest and appreciation to the contributors. And, in trying to work out why my site dropped off the results altogether, I've been experimenting. My site is about vacations. So I tried some tilde searches.

Apparently, Google believes that "Florida" is a synonym of "beach" (search for ~beach). That might be logical, since "Palm Beach, Florida" would be a common phrase. So how come "Florida" is not a synonym of "palm"? Does this indicate that the synonym tables are actually built by hand? And that Google has been paid by Florida's Tourism Office to associate "Florida" with "beach" but not "palm".

Clearly Google is doing something with "related words", even if the pedants insist that it's not actually LSI/LSA. Actually, mapping the results of all my tilde searches, I cannot conclude that the "association matrix" is machine-built. As others have said: too computationally expensive to have got such (generally) accurate associations. And yet there are some very odd associations. According to Google:


French and Italian are synonyms of Portuguese
buy is a synonym of rent
sales is a synonym of rentals

So perhaps a machine-built list subsequently edited by (inaccurate or biased) humans?

compar
09-11-2005, 04:00 PM
I realize that this is an old thread and I was just referred to it by a member on another forum.

I recently had a site returned from being in the sandbox -- it was sandboxed last January after having been well placed in the SERPs for a couple of years.

So please explain this in light of the LSI/LSA theory and the "~" search.

If you search on 'order medication online' my site wwwdotyourfriendlypharmacydotcom come up #1 in the SERP. But if you insert a "~" anywhere in the search string then you can't find my site in the top 50.

I agree that the "~" search appears to give all the words that Google is associating with the words in my keyword phrase, but if I'm number one due to LSI/LSA why don't I appear prominently in the "~" search?