Search Engine Watch
SEO News

Go Back   Search Engine Watch Forums > Search Engines & Directories > Google > Google Web Search
FAQ Members List Calendar Forum Search Today's Posts Mark Forums Read

Reply
 
Thread Tools
Old 02-04-2005   #41
seobook
I'm blogging this
 
Join Date: Jun 2004
Location: we are Penn State!
Posts: 1,943
seobook is a name known to allseobook is a name known to allseobook is a name known to allseobook is a name known to allseobook is a name known to allseobook is a name known to all
Quote:
Originally Posted by graywolf
So in a nutshell it's looking for the most unique pages
not necissarily most unique page...but placing extra weight on a page which is well defined as relevant to the search query by its contents outside of the specific matching word.

Quote:
Originally Posted by graywolf
with the most links from pages with similar topics?
the related page stuff is more Hilltop related.


some of the semantic analysis that is done on the page content may also be done on the linkage data.

if most of your links are exact matching keyword rich links and few links with variations or synonyms then that linkage profile may not rank as well as a site that has an equal quality and quantity of links with more naturally mixed anchor text.

if your site name also contains your primary keywords and most of your links contain your site name or that specific keyword from it then you could end up not only ranking bad for your primary keywords, but also ranking poorly for your site name.

for good karma mix your achor text
__________________
The SEO Book
seobook is offline   Reply With Quote
Old 02-04-2005   #42
graywolf
Member
 
Join Date: Jul 2004
Posts: 13
graywolf will become famous soon enough
Quote:
but placing extra weight on a page which is well defined as relevant to the search query by its contents outside of the specific matching word.
So if a page was about apples, it would also expect to find words like trees, pies, and/or fruit?

Revisiting the unique issue, if you were to take a competitors page add enough extra stop words and different versions of his words that will stem back to the same thing, would that make his page "look worse" from the algo's point of view ?
graywolf is offline   Reply With Quote
Old 02-04-2005   #43
seobook
I'm blogging this
 
Join Date: Jun 2004
Location: we are Penn State!
Posts: 1,943
seobook is a name known to allseobook is a name known to allseobook is a name known to allseobook is a name known to allseobook is a name known to allseobook is a name known to all
Quote:
Originally Posted by graywolf
So if a page was about apples, it would also expect to find words like trees, pies, and/or fruit?
those could potentially fit well. keep in mind that there is also another version of apple too with computers mac os X imac etc
Quote:
Originally Posted by graywolf
Revisiting the unique issue, if you were to take a competitors page add enough extra stop words and different versions of his words that will stem back to the same thing, would that make his page "look worse" from the algo's point of view ?
I doubt it. its across many many pages and in well developed communities a single page may not have much effect on the other pages unless it helps them trip a duplicate content filter, but to do that you might need to have more PageRank than the other page you are trying to delist.

it is not about most unique page. it is about page which is best matching.
__________________
The SEO Book
seobook is offline   Reply With Quote
Old 02-04-2005   #44
Adam C
Courses for horses
 
Join Date: Jun 2004
Location: London
Posts: 49
Adam C will become famous soon enough
Quote:
Originally Posted by bakedjake
As a matter of fact, I'm seeing old sites now go INTO the sandbox.
Have seen similar movements.


I remember when the ~ operator was first released. In the same month a search for "search engine optimiSation" started returning results with "search engine optimiZation" spellings. I took it at the time that there was some kind of mild implementation of the ~ in the main index. I could of course be completely wrong, as is often the case.

Last edited by Adam C : 02-04-2005 at 08:00 AM.
Adam C is offline   Reply With Quote
Old 02-04-2005   #45
glengara
Member
 
Join Date: Nov 2004
Location: Done Leery
Posts: 1,118
glengara has much to be proud ofglengara has much to be proud ofglengara has much to be proud ofglengara has much to be proud ofglengara has much to be proud ofglengara has much to be proud ofglengara has much to be proud ofglengara has much to be proud of
I've noticed some pages that target "SEO" terms are now appearing for "search engine optimization" ones.

AFAIK there are no links/anchor text using the full phrase, so if down to some sort of LSI, it seems to trump even anchor text.
glengara is offline   Reply With Quote
Old 02-04-2005   #46
Adam C
Courses for horses
 
Join Date: Jun 2004
Location: London
Posts: 49
Adam C will become famous soon enough
Just to clarify, what i was talking about was when word optimization was first bolded in searches for optimisation
Adam C is offline   Reply With Quote
Old 02-04-2005   #47
hard target
Member
 
Join Date: Feb 2005
Posts: 14
hard target is on a distinguished road
Quote:
Originally Posted by bakedjake
... and you're not writing pages synonymous with your term that don't contain the term you're targetting, you're going to be in a world of hurt within the next 90 days.
Wouldn't this imply that when comparing the following two searches:
1. ~keyword -keyword
2. keyword

the high ranking sites in #2 should also rank high in #1?. This doesn't seem to be the case on a small sample I looked at.SERPs are still dominated with "keyword" containing pages.
Did I misunderstand your post or does it have to do with timing --- "next 90 days"?
hard target is offline   Reply With Quote
Old 02-04-2005   #48
seobook
I'm blogging this
 
Join Date: Jun 2004
Location: we are Penn State!
Posts: 1,943
seobook is a name known to allseobook is a name known to allseobook is a name known to allseobook is a name known to allseobook is a name known to allseobook is a name known to all
Quote:
Originally Posted by hard target
Wouldn't this imply that when comparing the following two searches:
1. ~keyword -keyword
2. keyword

the high ranking sites in #2 should also rank high in #1?. This doesn't seem to be the case on a small sample I looked at.SERPs are still dominated with "keyword" containing pages.
Did I misunderstand your post or does it have to do with timing --- "next 90 days"?
many of the most relevant documents will also happen to naturally have occurances of the keyword so subtracting all of them out of the first subset of search results (like in #1) may end up making that set significantly different than #2
__________________
The SEO Book
seobook is offline   Reply With Quote
Old 02-04-2005   #49
hard target
Member
 
Join Date: Feb 2005
Posts: 14
hard target is on a distinguished road
Quote:
Originally Posted by seobook
many of the most relevant documents will also happen to naturally have occurances of the keyword so subtracting all of them out of the first subset of search results (like in #1) may end up making that set significantly different than #2
Agreed - so the original bakedjake's statement seems a bit radical, doesn't it? Or did I misunderstand it?

BTW, I have a great respect for bakedjake's posts in this and other forums; I just want to make sure that I understand correctly what the meaning of the post was.
hard target is offline   Reply With Quote
Old 02-04-2005   #50
xan
Member
 
Join Date: Feb 2005
Posts: 238
xan has a spectacular aura aboutxan has a spectacular aura about
Hi Im a computer scientist, PhD. I thought I would stick my nose in and clear some things up:

LSA is not new at all has been used, as someone has previously stated since the 90s. Its expensive but it has been proven to work well. LSI is LSA, its just another name for it.

LSI is used to address 3 problems: Synonymy and polsemy.( ambiguity). It arranges words into a concept space. Given all of the concepts retrieved, a set of documents can be retrieved. It also overcomes noise (punctuation, odds and ends that make processing a pain)

New methods since have definitely been introduced such as using an associative network from a corpus (Guy Denhiere). It is incremental, which is more plausible, and takes into account higher order co-occurrences in the construction of word similarity. It can use different units of context whereas LSA uses the paragraph.

LSA represents the meaning of words as a vector, thus calculating word similarity. Its not exactly rocket science but it has been efficient and is still used. The text here is considered linear. However any new semantic representation means running the whole thing again.

Other methods do exist for calculating words similarity which I will not go into detail about but will briefly explain:

SRCR (sparse random context representation) - 2002
It is assigned a random vector which is then updated with the vectors of co-occurring words.

WAS (word association space)
Words in similar contexts are placed in the same space.

LSA/LSI has major drawbacks:

The information is all numbers without semantic meaning its hard to debug.

It uses the SVD algorithm. The SVD algorithm is O(N2 k3), where N is the number of terms plus documents, and k is the number of dimensions in the concept space. If the corpus is unstable and grows rapidly, its unfeasible. The SVD algorithm is unusable for a large, dynamic collection.

Its hard to find the number of dimensions in the concept space. Nobody knows the optimal number to use.

precision-recall improves and then decreases after hitting an optimal state. So if you like it's unstable.

Using SVD on a large collection which is dynamic is horrendously expensive.

LSI is slow due to using a matrix method Singular Value Decomposition to create the concept space.

Popular methods include graph-based clustering and classification, statistics-based multivariate analyses (as well as latent semantic indexing: multi-dimensional scaling, regressions), artificial neural network-based computing (backpropagation networks, Kohonen self-organizing maps), and evolution-based programming (genetic algorithms)).

As you can see, Google has quite a choice of methods, and I doubt that LSI would be the best one considering the task at hand. The Google algorithm is complex and uses many methods found in information retrieval, data mining, and A.I. It is very unlikely that one method as routine as would be the main formula in this mathematical bundle. Also this issue only addresses semantic similarity not ranking, which i think is your priority.

Of course document similarity and topic detection are the main ways of returning relevant documents, however there are so many ways to do this and none of them are straight forward. In fact no one has yet found a stable way of applying methods that work very well in digital library collections to web data. The problem with data on the web is that it changes all the time. It's dynamic and unstable.

I keep a blog which deals with computing science methods where I explain these, and topics are based on things that I find in forums like this one, just to clear up any misunderstandings. I have no dealings with SEO, but visit SEO forums to assess how far professionals have come to using search and understanding its techniques. I work in A.I and computational linguistics. Sorry about the long post.

Berkey's explaination

search science

Last edited by xan : 02-04-2005 at 02:32 PM.
xan is offline   Reply With Quote
Old 02-04-2005   #51
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation Thanks

Thanks, Xan

Finally, someone is talking the truth about LSA and SVD. No news here. Good to have you in the SEWF.

On-topic analysis and co-occurence theory can be used to explain the above as well as terms disambiguation. See you all at the SES NY.

Orion

Last edited by orion : 02-04-2005 at 02:47 PM.
orion is offline   Reply With Quote
Old 02-04-2005   #52
randfish
Member
 
Join Date: Sep 2004
Location: Seattle, WA
Posts: 436
randfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to allrandfish is a name known to all
xan,

Thank you for joining and contributing. It's important that we have people like you to help us out - your contributions mean a lot and your willingness to share is commendable.

I (and probably many other members of SEW) would love to visit your blog and read some of your writings - would you share it with us?

Also, the alternate methods you mentioned, along with LSI/A all are shooting for the same goal: to use the index of the world wide web to calculate the relationships between words in order to have a better understanding of which concepts are related and which are not.

As I see it from an SEO perspective, our goal is similiar (albeit for a different reason). We want to puzzle out which words and phrases are most semantically connected to one another for a given keyword phrase, so that as search engines crawl the web, they see that links to our pages and the content within them is semantically related as per the other information in their database.

However, we have a big advantage. We don't have to use an algorithm to calculate this neccessarily, and we're not concerned with computational expensiveness. Why? Because we optimize individually for single keyword phrases - meaning we can devote an hour or 24 to finding the most connected keywords/phrases.

Let me propose an SEO method for discovery.

#1 Search for your keyword phrase @ Google
#2 Take the text from the top 100 search results and put them into rows in a table (remove stopwords)
#3 Analyze the most frequently occuring 1, 2, 3 & 4-word phrases among the documents (discount duplicate entries from a single row/page).
#4 Take the keywords that show up in 5+ entries and conduct a C-Index (Keyword Co-Occurence) calculation for each.
#5 The results will give you the highest C-Index words/phrases for your particular term.

Xan, perhaps you can tell us if this is a good or faulty method.

I wish I was going to see you all at SES (alas, it's far outside my price range).
randfish is offline   Reply With Quote
Old 02-04-2005   #53
seobook
I'm blogging this
 
Join Date: Jun 2004
Location: we are Penn State!
Posts: 1,943
seobook is a name known to allseobook is a name known to allseobook is a name known to allseobook is a name known to allseobook is a name known to allseobook is a name known to all
your search science link is missing the "members" in its link

http://spaces.msn.com/members/search-science/
__________________
The SEO Book
seobook is offline   Reply With Quote
Old 02-04-2005   #54
AIstudent
 
Posts: n/a
LSA - some bottom lines

Hello everyone,

what joy! I'm a final year student of Artificial Intelligence and Psychology (in Edinburgh) working long hours on my Bachelor thesis - which is about Latent Semantic Analysis. So for once in my life I feel I can contribute some knowledge :-)

First off, the LSA/LSI confusion: From as far as I have seen so far, LSA and LSI are exactly the same thing. The technology is usually called Latent Semantic Indexing if it is used in information retrieval context and latent semantic analysis if it is used for language modeling or most other applications. I'll go with LSA from now on.

How does it work?
LSA works with a vector space representation of words (and documents). You can imagine every word as a point in space, only that the space is not 3-dimensional but usually has anything between 150 and 500 dimensions. Words which are more "similar" are closer together in this space. What type of "similar" are we talking about? Well, let's look at how LSA vectors are constructed. You start of with a bunch of M (say, 10,000) documents and a dictionary/vocabulary of N (say, 20,000) word-types from a large corpus (usually > 10m words) . Now you build a NxM matrix where you count how often each word n (from N) occurs in a document m (from M). This is one hell of a fat matrix you end up with, but it contains some useful information: which types of words usually occur in the same type of documents. In a sense, you now have a M-dimensional vector describing each word in terms of where it usually comes up. The problem is that a) these vectors are large, and b) influenced by "noise" - maybe two words are actually quite similar but just by coincidence they rarely pop up in the same documents.
This is where the strange mysterious beast of Singular Value Decomposition comes in. I don't fully understand it myself (don't tell my supervisor..) but SVD basically "shrinks" the vectors to a smaller size (e.g. from 10,000 dimensions to 100). The resulting (reduced) vector for a word now, in a sense, contains the "concentrated" semantic information about that word. the beauty of it is that after the SVD process two similar words (e.g. "coke" and "pepsi") have similar vectors, even if by coincidence they never occurred together, just because they have many "common friends", e.g. "drink", "cool", "beverage", "soft drink" etc.
Time complexity is roughly proportional to NxM, if I remember correctly.

Uhh, that was too much maths, what's the bottom line?
LSA calculates a measure of similarity for words based on occurrence patterns of words in documents and on how often words appear in the same context or together with the same set of "common friends"

Seeing is believing. Can I try it?
Go to
http://lsa.colorado.edu/ and play around with the applications.

Where can I read up on it?
The papers at
http://lsa.colorado.edu/ are a good start.
They are written by Psychologists, which makes them easier to read than those written by Computational Linguists :-) (use Google Scholar if you want to get your hands dirty on formulas.

Could Google (or any other engine) use it?
There's technical and legal aspects to this.
Technical first:
Simply allowing search which uses some LSA information about the keywords to maybe consider some similar words is simple. They just need a corpus (big G's 8,058,044,651 web pages should do fine for most purposes ), a dictionary file, and some standard SVD algorithms (like this ), a few hours time while it calculates, and some minor changes to their search mechanism.
But LSA allows for something much more sophisticated: Just as every word can be represented as a semantic vector, so can every document be condensed into such a vector. This allows a judgement of how similar two documents are, way beyong just counting words. It even works for different languages (with a few tricks).
What's more, in LSA terms a "document" can be as small as a string of just a few words. So, you can compare semantic similarity between documents and search strings (or other pages... or a few sentences copied&pasted from another site...). Think for a second how much you could do with that!
For this use of LSA, for a corpus the size of the WWW, you'd either have to have a REEEAAAAAL big machine, or a new way of doing SVD, or a system which does it for a small core of a few million documents and then "weaves in" all additional documents into the existing vector space. As I said, time and memory complexity are roughly proportional to NxM. Here M, the number of documents, being several billion (and N being at least many 1000's), would be the critical factor.

Now, legal contraints:
As far as I am aware, some aspects of LSA/LSI for information retrieval are patented (Pantent Description) to the people who first worked with it (some of whom now work for compaines using LSA).
So, my guess is that, if Google (or anyone else) is interested in technology like this, they will either use some related approach which is not covered by the patent (am no lawyer, no clue how easy it is), or get in touch with the patent holders.

Hope this helps,

Tobi


P.S. Hope this is no forum abuse... will graduate soon, am looking for internships & jobs int his field. PM me if you can help
  Reply With Quote
Old 02-04-2005   #55
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

Please feel free to PM with your resume. I'm looking for top AI staff. But in the future please post in the Help section, as this is a bit off-topic.


Orion
orion is offline   Reply With Quote
Old 02-04-2005   #56
millington
Member
 
Join Date: Jan 2005
Posts: 22
millington is on a distinguished road
Sudden reduction in Google referrals yesterday

I'm new to this so not sure if this is the right place to post, but here goes.

I have a website at www.construction-index.com which has steadily built up over three years to about 3,000 visitors per day via Google search engines (mainly .com and .co.uk). Then yesterday, February 3rd, the number of visitors via Google suddenly dropped overnight to about one third of its normal level (ie to only 1,000 visitors a day). Some of my pages continue to come up in the first few returns on the first page of Google returns; but many other of my pages which used to come high on the first page now come on page three or four of Google returns.

Has anyone else had the same experience? Any suggestions as to what might have caused this? Any suggestions for remedial action please?
millington is offline   Reply With Quote
Old 02-04-2005   #57
xan
Member
 
Join Date: Feb 2005
Posts: 238
xan has a spectacular aura aboutxan has a spectacular aura about
Thank you for the warm welcome!

1 Search for your keyword phrase @ Google
#2 Take the text from the top 100 search results and put them into rows in a table (remove stopwords)
#3 Analyze the most frequently occuring 1, 2, 3 & 4-word phrases among the documents (discount duplicate entries from a single row/page).
#4 Take the keywords that show up in 5+ entries and conduct a C-Index (Keyword Co-Occurence) calculation for each.
#5 The results will give you the highest C-Index words/phrases for your particular term.


Ok. I see where you're going with this, and it is a valid method. What you are talking about is using N-grams (which are strings of words in sequence) and calculating term frequency and then idf (inverse document frequency).

This will easily give you a hunch as to what terms are generally being used, but for a thorough analysis, you would have to compile a corpus of relevant sites to yours.

Using semantic fields is a pretty good method to discover which terms are related.

Do remember to always look at it from a technical point of view. Topic detection between sites is carried out in order to retrieve relevant sites, followed by computational linguistic methods to sort again between this collection, then ranking methods are used to order these sites by relevance. That's a very basic model, but the ranking method will again use many computational linguistic methods.

So looking at it from this point of view - it's important to make sure that your site is relevant, and offers a large amount of information as this encourages density early on, I think you all know these basic but very effective methods.

When I rank, I use similar methods and I definately get rid of all the noise which will include websites that do not meet a certain threshold.

Sorry so going off topic here!! Basically, yes, it is a valid method.

(p.s: also know which stopword list is best for you to use, you may even want to make your own - wsj is used a fair bit)

Last edited by xan : 02-04-2005 at 04:08 PM.
xan is offline   Reply With Quote
Old 02-04-2005   #58
xan
Member
 
Join Date: Feb 2005
Posts: 238
xan has a spectacular aura aboutxan has a spectacular aura about
Quote:
Originally Posted by seobook
your search science link is missing the "members" in its link

http://spaces.msn.com/members/search-science/
thank you seo book.
xan is offline   Reply With Quote
Old 02-04-2005   #59
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

Quote:
Originally Posted by randfish
#3 Analyze the most frequently occuring 1, 2, 3 & 4-word phrases among the documents (discount duplicate entries from a single row/page).
#4 Take the keywords that show up in 5+ entries and conduct a C-Index (Keyword Co-Occurence) calculation for each.
Just be sure to assign the proper meaning to the computed c-index.

If computing co-ocurrence for phrases be sure to instruct the system to recognize a term sequence as a phrase. Keep also in mind that this will force an ordering element into the retrieved set of documents, thus excluding documents without the target sequence.

Once you have identified the terms you must do an on-topic analysis. I wish you could be at SES NY.

Orion
orion is offline   Reply With Quote
Old 02-04-2005   #60
xan
Member
 
Join Date: Feb 2005
Posts: 238
xan has a spectacular aura aboutxan has a spectacular aura about
You're right Orion, but you can around this by usng the tf for single words to n-grams and then doing an idf score on those.

you can then see where the precision decreases.
xan is offline   Reply With Quote
Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off