|
#1
|
|||
|
|||
|
Filthy Linking Rich
Just read Filthy Linking Rich by Mike Grehan.
It brings up the theory that popular sites will continue to get more popular and attract new links while new sites will find it increasingly difficult to pick up any links and gain any sort of popularity. He moves on to say that it causes new sites that may have a wealth of information to be missed off of a search engines radar. Quote:
![]() |
|
#2
|
|||
|
|||
|
Without reading more than the excerpt of Mike's article, I can tell you from experience that he's dead-on target. Linkpop also tends to favor type-in domains.
Engines using clickpop (like the old Snap) are also skewed by site age and who got in position first. Relevancy is a bitch, isn't it? |
|
#3
|
||||
|
||||
|
rcjordan, I really encourage you to read ALL the article as I'm sure you will learn quite a lot.
Anyway, in my opinion, I think a more on-topic excerpt to this article would be: Quote:
|
|
#4
|
|||
|
|||
|
I quite agree - been observing this for some time - that's why I've stated elsewhere that modern SEO is born of the need to give newer sites the attention they deserve.
I also sometimes use a metaphor, of how Google's system of counting "votes" would effectively give Adolf Hitler a seat in the 2004 German government, on the grounds that he got quite a lot of votes a while ago, so they should still be counted now. Even though that's obviously not preferable, the trouble is, without linkpop you have on-page only and AltaVista all over again. With linkpop it evens the field a lot more - but something needs to be done not simply with how the linkage data is analysed, but also to help with analysing the actual content of the page itself. So, as I said to Nacho, linkpop is "mostly fair" - at least, with regards to the current practical options being presented. But linkpop has its limitations, and these are not yet being addressed. That's why when John Scott was asking what we'd like to see in a new algo for search, I recommended randomising the rankings a lot more, to give other relevant content a fighting chance of exposure, instead of letting a handful of pages continually dominate any particular SERP. Just my opinionated 2c. EDIT: And now to read Mike's article - been waiting for this since he mentioned working on it a while back. ![]() EDIT 2: Just read Mike's article - what I really like to his approach to marketing is that he doesn't take a partisan approach. He simply reveals that here is huge set of information called the internet, and that assigning any relevancy to the actual data is a mammoth and incredible complex job. Namely, this is because as humans we can make various conscious judgements regarding the relevancy of the page, according to our own subjective experience. Yet these acts of consious judgement, that we take so much for granted, are beyond the means of even the massive software development programs such as search engines. And the day that search engines can return entirely relevant results to individual users, is perhaps the day that search engines pass a Turing Test. I don't know about you, Mike - but that would scare myself just a little. ![]() Last edited by I, Brian : 10-08-2004 at 03:27 PM. Reason: added new comments, rather than new post |
|
#5
|
|||
|
|||
|
>you to read ALL the article as I'm sure you will learn quite a lot.
Damn, Mike, the Animals --really? But I must admit I'm confused by this article ...I thought your dad was a circus performer? |
|
#6
|
|||
|
|||
|
The Grehan essay merely states the obvious. Some social network software and blogging fans have paid a fair amount of attention to the linking system between blogs. Clay Shirky has some information on power law distribution and some graphs that may be helpful.
Webmasters have even correlated Alexa rankings to web traffic and have come up with this graph. Alexa is skewed toward those who tend to use their toolbar, as argued here, but that doesn't change the basic power-law distribution. Why do you think Alexa's graphs are logarithmic? It's because if you want to spread out the distribution or resolution more evenly across a scale, and your data is power-law data, you have to use a log scale. What the Grehan article fails to mention is why Google uses link popularity. It's because by using a metric that is independent of the content of the page, they can presort the docIDs in the inverted index. Then you only have to scrape off the top documents when you get the user's search terms. You don't have look at the entire list of pages that contain those terms. The additional algorithms that need to be applied with reference to the user's terms are now only an extremely tiny subset of the available documents for those terms. Since it was presorted by PageRank, it cost you nothing at search time to grab this subset. It's important to save this on-the-fly time, even at the cost of spending a few days calculating PageRank the way that Google used to. It's ranking on the cheap. Google's genius was to make it sound like a brilliant step forward. Actually, it was merely a way to dramatically reduce their CPU overhead for answering searches. Google made two additional shortcuts after their original sin of installing and then hyping PageRank. One was that they actually amplify the power-law bias by giving more weight to pages that are already "important." It's not merely the number of links, which is already horribly, hopelessly skewed by the power law, but Google comes along and pretends that this isn't a bug that needs correction, but rather a feature that needs amplification. The other sin was that at least until April 2003, Google crawled the web in PageRank order. Sites with higher PageRank got crawled earlier, deeper, and more reliably. Google has been all hype from day one, and the web has suffered for it. |
|
#7
|
|||
|
|||
|
Google certainly could be doing a better job with search results, but I think that MSN and Yahoo! need to compete with them. The results at Google are currently 'better' only because Google crawls more of the web, more often than their competitors.
The problem with the link popularity usage is the Catch-22 of popular sites becoming more and more popular and linked to. The promise of clustering may solve some of this problem by providing more unique results per query than a straight search. If users had the option to narrow searches by topic/flavor, etc. far more sites would be given publicity than the current status quo. We'll just have to wait and see - it's a big issue though and one that will probably get much worse before it gets any better... |
|
#8
|
|||
|
|||
|
The whole rich get richer aspect of link popularity algos could easily be corrected for just by adding one more variable to the equation. It is really quite simple.
|
|
#9
|
|||
|
|||
|
Quote:
Google still claims that "PageRank relies on the uniquely democratic nature of the web by using its vast link structure as an indicator of an individual page's value." The unwashed masses and mass-media pundits bought this completely. The folks on top know better, but they also knew to keep their mouths shut because Google was handing them ownership of the web by ranking them on top. Now the little guy on the web is almost extinct, and PageRank is but another footnote in the sordid history of human inequality. |
|
#10
|
|||
|
|||
|
Everyman, just because they are not implementing a more egalitarian link-pop algo doesn't mean it isn't possible. Yesterday it just hit me in the head how a simple formula could correct the imbalance. I am not really ready to openly talk about the idea, but we can correspond via email about it if you are interested.
|
|
#11
|
||||
|
||||
|
Quote:
But then again, the purpose of good search results is not the same is democracy anyway. It's not abaout being fair, but being fast and accurate and "good". I don't care that it might be fair to mention 100 other similar businesses to what I am looking for if what I get is good enough. I am not looking for 100 but for one. The right one. |
|
#12
|
|||
|
|||
|
Quote:
|
|
#13
|
||||
|
||||
|
Quote:
I don't think the "democratic" element of PageRank ever worked ... Except, as a public relations driver. However, this is by farthe only paradox with Google but thats another topic |
|
#14
|
|||
|
|||
|
Indeed, and a democratic process usually occurs within a well-defined time frame. The issue of link pop does not. Hence my comment that the Google democracy would take Hitler's votes into account for deciding the next German government.
That isn't intended as an attack at Google, by the way - I'm not associating Google with fascism. It's simply a somewhat extreme demonstration on why counting older "votes" doesn't necessarily add any relevance - and could detract from it. In practice, I figure that's why Alta Vista ranks higher than Google for the search term "search engine", even though Google has dominated for years and Alta Vista has been effectively dead in the 21st century. Old "votes" can damage relevancy. And it's certainly not democratic. |
|
#15
|
||||
|
||||
|
Quote:
Quote:
I am sure more detail and more 'why' do they do that and this, can be found in the 3rd edition of his book. Let's give Mike some credit, he spends a heck of a lot of time researching, traveling to meet experts and sharing this information with us. I for one have learned a lot for Mr. Grehan. ![]() |
|
#16
|
|||
|
|||
|
Is it just a happy coincidence that Google not allowing new sites to rank forces them to use Adwords to generate traffic thereby making Google's bottom line look better?
When the PPC results are fresher than the organic results thereby attracting clicks- that doesn't hurt the bottom line either in the short term. I guess that is the real reason "why" . |
|
#17
|
|||
|
|||
|
Quote:
Mike's article references a research paper that has stats suggesting this is the case. Quote:
Quote:
The experiment gathed all the pages from 154 web sites, then started following outbound links from those web sites to build a link graph of the web, from what I understand. Then they did the same seven months later. That let them know which pages got the most links -- thus which pages THEY believe are more popular and seen by more users. Of course, counting only links doesn't take into account link context. Go on, get 1 million people to link to George W Bush's official biography with words OTHER THAN miserable failure in or near the link. Now that page isn't going to rank well for those words. The reason is does at the moment is because the links are combined with those terms. Link popularity is numbers. Link context is the words in or near the link. Combine them together, you get link analysis, which is what search engines use to help rank web pages. All the experiment shows is that some unknown pages with a lot of links got more. But these could be poor quality pages. They might be all spam pages, getting all spam links. More important, we don't know how they rank for anything. No one tried to see what terms these pages were showing up for in search engines. The idea that no one sees the poor pages with no link gain? Untrue. You could only say that if you knew exactly what pages were in the bottom and looked at traffic patterns. Heck -- if they'd taken those results and even grabbed some basic Alexa data, we'd have a better idea if being "link poor" really was an impact. They do later in the report try to get some quality into the experiment by also computing alleged PageRank scores for each page. Alleged? Yes, since they are using a past formula that Google published which may not be in use today. The formula, for example, may not be discounting various links that Google using some spam filtering techniques might not pick up. OK, those with more inbound links also had PageRank increases. But that still leaves out the fact the PageRank still doesn't equate with getting top rankings. Amazon is a PR8, comes up top for books. Now search for cars. Not there -- and not there because no one links to it with the word cars in or near the links. Context can trump PageRank. PageRank is merely the catch name for the link popularity part of link analysis. Heck, Google's a PR10 site. If PageRank beat all, you'll see it always showing up for everything. That certain does not happen. So the study doesn't prove anything to me. Instead, I'm stuck with relying on anecdotal info. Plenty of people certainly feel like new sites have no chance, based on their own personal experiences. I can't discount that. It may be this is exactly the case they are finding. However, my impression is that new sites rank well all the time. I get that mainly from the fact I'm not flooded with hearing from people I talk with suggesting they have new sites that do poorly. The opposite seems to be true. People still seem to launch new content and do well on Google and other search engines all the time. The whole SEMPO tahoe thing is an example of this. That site moved into the top 10 for SEMPO before many people really seemed to be linking to it. It's still there rock solid now, number three when I last looked. That's a brand new site. Where's the sandbox? Where's the no reward to new sites? How about Coke's C2 web site? Not that old, yet it ranks number three when I look. Shark Tale? How old can that site be, yet it's number one (as it should be). How about a search for Gmail? I see Gmail Swap at number three -- can't be that old. And the Gmail Drive extension site? That's maybe two or three weeks old, if I recall. And how about Everyman's Gmail is too creepy page? Again, not that old. There's no doubt in my mind that older sites WITH GOOD CONTENT will have an advantage with link analysis. There's also every reason they should. If they've become great resources on the web, still provide good information, why wouldn't you want them to rank well? There's also the fear, however, that results can be a self-fulfilling prophesy. What if you search and always get the same thing? Ironically, while that sounds bad, last year's big Google update, many site owners were precisely upset because long-standing rankings they held in Google were lost. I think the answer is that in part, it will depend on the topic. If the topic has shifted, if there are new resources, I'd hope the search reflects some diversity in the results that come up. The new Snap.com service offers another approach. Rather than rank primarily on link analysis (which is the heaviest component I think many will agree at Google and others), you can choose to instead see another metric, like traffic popularity. And the pitch is down the line, you could rank by "satisfaction." I think having those options are nice. But for the typical searcher, I think they are also confusing. Ages ago, you'd do a search at MSN and get a little "See Top 10 Most Popular Results" link to view an alternative view of the search by Direct Hit, as measured by clickthrough analysis. The problem I always saw with this is that user are going to assume that by default, you're serving up the best pages. They don't, in my view, see a difference between popularity and relevancy (and as we've seen, popularity can be measured in different ways). At some point, you decide on a metric blend and go with it. That's not going to be perfect. More options for sorting are welcomed, too. It's just I doubt many will make use of them. Lastly, what is true is that link analysis is far less useful than in the past. Today, we have people building up extensive and clever artificial link networks. We've got bloggers shifting the linkosphere in major ways. We've got linkbomb campaigns happening. Personalization of results offers one way we'll advance beyond this. Specialization of results, specialized databases for certain types of searches will also help. And new metrics will come in. Heck, maybe the search engines will remember one day that it can be helpful to have a few human editors review what you're putting out, as well. |
|
#18
|
|||
|
|||
|
>The whole SEMPO tahoe thing is an example of this.... Where's the sandbox?
Less than 50,000 results. >How about Coke's C2 web site? Not that old, yet it ranks number three when I look. For coke I stopped at 100 results, no show. For coke c2? Less than 50,000 results. >Shark Tale? How old can that site be Over a year according to whois, December 2003 says wayback. >How about a search for Gmail? I see Gmail Swap at number three -- can't be that old. Very recent at May 2003, check the backlinks though. >And the Gmail Drive extension site? That's maybe two or three weeks old, if I recall. I don't see that at all, not sure if you mean a new section of an old established site or not? >And how about Everyman's Gmail is too creepy page? Again, not that old. Hand boost ![]() Trust me the sandbox is very, very real. There are ways round it and these very clearly show some of the factors at play. |
|
#19
|
|||
|
|||
|
For the record, Google says there's no sandbox. That's the official line
![]() There are so many reports from others and weird things I've seen as well that I tend to believe the idea of something sandbox-like. I just think the impression some have that this applies to any new site, any new link may be far too broad. Quote:
viksoe.dk - GMail Drive shell extension is a brand new page, as far as I can tell. The date of the page itself is Oct. 4. I remember it mainly from seeing a bunch of blog posts earlier this month as a new way to turn Gmail into a virtual drive. It comes up on a search for gmail that brings up over 3 million matches. Yes, I suspect the domain isn't brand new. I suppose the argument could be that if it was a brand new page, in a completely brand new domain, then it might be a different story. Perhaps. I think not, but I certainly can't say definitely. How about a sound off? Anyone seen a brand new web site rank well for any terms? Competitive terms? Anyone not Name the site and terms if you like or just chime in generally.My original post is mainly to say that I've seen these type of general statement, yet I also feel like I see lots of exceptions to them. I think it's much more gray. I don't think that it's the rich get richer, though plenty of them WILL add to their wealth. But nouveau rich also come onto the scene for lots of other factors. And, I expect some of the rich go bankrupt over time, as well. Last edited by dannysullivan : 10-17-2004 at 10:08 AM. Reason: fixed spelling |
|
#20
|
|||
|
|||
|
>I did a search for c2, 12 million matches
I stand corrected. I can spot 9/10 sandboxed sites without looking at the site, that is the 1/10 that stumps me. |
![]() |
| Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
| Thread Tools | |
|
|