Join Date: Nov 2005
I came into the issue a little later than some. Like Jill, I don't work on too many sites that are newly registered. About Oct. of 2004 I was working on a site that truly baffled me; I hadn't seen anything like it before then. It showed the same exact characteristics that bwarne and Rand describe. Performance for nothing but the longest-of-the-long tail terms.
Back then, SEO Chat and a few others were talking about it, but not too many. I think many sites were in it, but not too many had come out of it yet. A lot of people felt vindicated when Danny summed up sentiment at the SES Winter 2004 keynote (as summarized by Barry):
In the last couple months, the pendulum has swung back pretty far. It's sort of popular now to say it just doesn't exist. When you get into these articles deeper, they say that "all you need to do to avoid it" is A, B, C, F, G, and 90% of Q. To me (along with what I've seen), that just confirms that it's real.
Last edited by erik : 12-23-2005 at 11:41 AM.
Join Date: Oct 2005
seobythesea has a page rank=0
I would like to see a better example of a site in the sandbox, because randfish has given us one that obviously has some problems.
The non-www version of seobythesea has a Page Rank=0, while the www version has a Page Rank=4. If I can see this, then Google's bots can see this as well, so there is bound to be confusion, particularly if the site was originally submitted without the www as a prefix.
Also, the large number of pages appearing in Google's supplemental index indicate that there was something else that was once very wrong with this site, and an attempt to fix it was made some time ago.
So, it makes perfect sense to me that this site would sit in a "sandbox" while Google's algorithms sort it out. Eventually, it will suddenly appear in the Google SERPS, and its creator will naturally assume that there is a sandbox.
There are also other problems with this site that may be causing the delay, so if anybody has another example that they would like to share, please provide it...
Join Date: Jul 2004
Oversees: Search & Legal Issues
Join Date: Jun 2004
Location: Calgary, Alberta, Canada
I wrote a series of articles months ago that actually detailed what I felt the sandbox effect was, as well as how to get in/out of it, etc. I later deleted them and have since kept my results and tests fairly quiet for personal reasons.
As a result I can't remember the exact wording I used at the time, but one part of it went something like this:
The key to SEO (any type of SEO, not just sandbox avoidance) isn't links, or hilltops, or content or even trust. Trust is the closest - I just don't like the word "trust" used in conjunction with a search for a really bad site, for example. No, the holy grail, in my opinion, is confidence.
The more a search engine can be confident that the result it supplies to you is what you are looking for, the more likely you are going to be supplied with that result - i.e. the higher the site will rank.
Things like links, and content, and authority and all that stuff are just methods of attempting to ascertain how confident a search engine can be in presenting the site.
This may seem obvious or trite to someone not used to thinking things through very deeply, but put down the eggnog and indulge me for a moment...
Stop thinking about links and content. What else would inspire confidence in a website? What about the lack of duplicate data? What about a URL structure that lets the search engine know that it's definitely not indexing the same 15 pages over and over again?, what about a server NOT going down all the time? What about links outwards to sites that are known to be useful to searchers for the content they've just searched on? What if the site approaches the search term from a different angle than most of the other sites (ie it's a museum or directory rather than a commercial site, etc).
What about how long people link to it? A site that people link to for 2 months and then stop is probably not a good site (and probably buying or trading for them, or doing some sort of serial linking campaign). A site that has static 4-year-old links from trusted authority sources is probably a good site.
All of these things can affect the confidence levels a site has as a result for a particular query, or for a position on a results page for a particular query.
Of course, these are usually not yes/no answers - if you only rate a 46% confidence level for a keyword, that kind of sucks (I'm making these numbers up for illustration ONLY), but if the other choices are all 22% or lower, then you will be firmly placed in a top position, even though frankly it's not that great of a site. Just because a site is number one doesn't mean it's a good site, it's just considered the best of a bad lot.
I want you to look at something - find a site that is in a sandbox, and look at a keyword that it ranks for. Now look for the closest Supplementary Result. See a connection? Now think about what Supplementary Results are, and what that connection means. Look really, really close.
Sandboxed sites usually appear immediately above supplementary results. If there are no displayed supplementary results for a search (because there are so many other ones that the search engine can show instead), your site probably won't show up.
The Supplementary Results are a separate database of "last gasp, only show if nothing else works" results. They have a confidence score (else they would not show up at all), but it's extremely low. These include pages that either go down a lot, or that have been recently not found but used to be good, etc. In short, they are on topic, but there is almost no confidence in them.
I've noticed that "sandboxed" sites typically are sites whose confidence score is very low, but better than the ones in the supplementary results database (I suspect that they are the lowest or bottom results in the normal database).
That's a fairly accurate method to tell if something's been sandboxed. Find it's relation to the Supplementary Results for that search term. It's not the only method, but it's quick and and easy.
The sandbox has nothing to do with trust or age, or ccTLD - it's all about confidence, IMO. If you want to declare all sites that have very low confidence ratings as "sandboxed", then fine. For me, they are just sites that the search engine isn't confident about (yet).
It's perfectly possible (even common) for a site to be highly relevant, but not be assigned a high confidence level due to other factors.
IMO, the sandbox effect is related mostly to the length of time a domain has had particular links to it for. Which is actually very different from site age itself. An old site with no links to it will be "sandboxed" based on the first day new links are discovered. Likewise, an established site that resets it's historical data through a redirect, merge, or change in ownership/direction will often suffer the same effect.
Since the links age is only one criteria, a site that can show itself to be trustworthy because of other factors (ie really, really good links, etc) would override the negative aspect of the young links.
It appears you need links for about 6 months before Google begins to be confident that they are permanent links and gives you full credit for them. In short, you need at least 6 months of historical data. Since it usually takes 1-3 months for a new site to be fully spidered, you will note that the most common "sandbox" times are 6 + (1-3), or 7-9 months. It could be as soon as 6 months and one day, or as late as 12 months, but I most often see 7-9 as the common range for a standard (non-aggressive but competent) sites.
A brand new site launched by a very trustworthy company, or a site that has garnered lots of natural links, may easily be deemed as a site a search engine can present as a result with confidence, regardless of the youth of it's links. Young links are only one aspect of the whole thing, that's why (IMO) there are so many exceptions to the so-called "sandbox".
You can also avoid the effect if the site is assigned some of the historical data of another via a merge of some sort.
My suggestion for SEO in 2006 - make your site one that a search engine could show with complete confidence to a searcher for your term. Make sure its technology is sound, it's links trustworthy and it's content useful. If that sounds like what the search engines have been preaching all along, it's because it is - they are just finding different ways of measuring it.
Of course, I'm sure some people's response to all this will be along the lines of the old joke: "The secret to success is sincerity - once you can fake that, you've got it made!"
Last edited by mcanerin : 12-23-2005 at 03:20 PM.
Oversees: Search Technology & Relevancy
Join Date: Jun 2004
"Age" of site in Baeza's papers does not refer to when the site was created. They use "age" in a lose sense to indicate modifications to pages, in particular this refers to the age of links and to an age based pagerank
The following is from Section 5 of
Web Structure, Age and Page Quality
5. AN AGE BASED PAGERANK
"Suppose that page P1 has an actualization date of t1, and similarly t2 and t3 for P1 and P3 such that t1 < t2 < t3. Lets assume that P1 and P3 reference P2 Then, we can make the following two observations:
1. The link (P3,P2) has a higher value than (P1, P2) because at time t1 when the frst link was made the content of P1 may have been different although usually the content and the links of a page improves with time. It is true that the link (P3, P2) could have been created before t3 but the fact that was not changed at t3 validates the quality of that link.
2. For a smaller t2-t1, the reference (P1, P2) is fresher, so the link should increase its value. On the other hand the value of the link (P3, P2) should not depend on t3-t2 unless the content of P2 changes."
The authors actually mentioned the problem of using site creation as age as opposed to link age and suggest the following solution.
"A problem with the assumptions above is that we do not really know when a link was changed and that they use information from the servers hosting the pages, which is not always reliable. These assumptions could be strengthened by using the estimated rate of change of each page.
Let w(t, s) be the weight of a link from a page with modification time t to a page with modification time s, such that w(t, s) = 1 if t>= s or w(t, s) = f(s-t) otherwise, with f a fast decreasing function. Let Wj be the weight of all the out-links of page j, then we can modify Pagerank using:
PRi = q + (1 - q) SUM w(tj, ti)*PRmj/Wmj
where tj is the modification time of page j. One drawback of this idea is that changing a page may decrease its Pagerank."
The idea here according to Baeza et al. is that each time a page is changed but the links stay, implies a reaffirmation of the links on that page.
Studing the phenomenon with a small subgraph as TodCl was a good choice since the researchers were able to have a less noisy environment. However, when the phenomenon was known, some dismiss it as a feature of small subgraphs or a consequence of a "the rich getting richer".
The fact is that the phenomenon studied by Baeza was different from the one reported by Barabasi, et al (the "rich-get-richer"), which is a more global and older phenomenon, observed across several length scales, and that predated Google.
By 2002-2003 both phenomena were well known, widespread and no longer possible to be ignored --proving fallacious the notion of PageRank as a democratic scoring system that levels the field.
While only Google can confirm, it is possible that around those days they decided to impose a very wise "wait and see" approach in an attempt to level the field, which is now referred to as sandboxing. Thus the notion of placing some sites in a "waiting", "maturing" status as some refer to as sandboxing is only valid on the grounds of cause-effect arguments.
Again, this research was from 2000-2001 and published in 2002 and further expanded. Was Google aware of these results? Let's look at two places in Baeza's paper: the conclusion and a little footnote after the Reference section (link may not work since is very old)
"In this paper we have shown several relations between the macro structure of the Web, page and site age, and quality of pages and sites. Based on these results we have presented a modified Pagerank that takes in account the age of the pages. Google might be already doing something similar
according to a BBC article 3* pointed by a reviewer, but they do not say how. We are currently trying other functions, and we are also applying the same ideas to hubs and authorities"
"3* ****://news.bbc.co.uk/hi/english/sci/tech/newsid_1868000/186395.stm(in a private communication with Google staff they said that journalist had a lot of imagination)."
End of the quote.
Back in August, Rand kindly sent me his San Jose, SES presentation on link analysis for a quick review and I suggested him to modify it to include the rate of change of pages, including the rate of change of links. The reason is that when dealing with temporal data you need to look at change ratios, not absolute counts.
The second article from Baeza, et al, Web Dynamics, Structure, and Page Quality, expands on both the age of pages and rates of changes. One needs to look at date of creation, updates, deletions, and age of page: the latter is different from date of creation: "We focus on webpage age, that is, the time elapsed after the last modification (recency)."--they stated.
In particular they looked at:
The data is obtained by repeated access to a large set of pages during a period of time.
Notice that in all cases the results will be only estimation because they are obtained by polling for events (changes), not by the resource notifying events.
For each page i and each visit i there is:
The access timestamp of the page visit i.
The last-modified timestamp (given by most web servers; about 80%-90% of the requests in practice) modified i.
The text page itself, that can be compared to an older copy to detect changes, specially if modified i is unknown.
There are other data that can be estimated sometimes, specially if the revisiting period is short:
The creation timestamp of the page, created i.
The delete timestamp of the page, when the page is no longer available, deleted i
There are different time-related metrics for a web page, the most common are:
Age, visit i - modified i
Lifespan, deleted i - created i
Number of modifications during the lifespan, changes, i
Change interval, lifespan i/changes i
For the entire web or for a large collection, useful metrics are:
Distribution of change intervals.
Time it takes for 50% of the web to change.
Average lifespan of pages.
"One of the most important metrics is the change interval; Figure 1.2 was obtained in a study from 2000 . An estimation of the average change interval is about 4 months."
The gist of this research was that with temporal Web data, weight occurrence and measurement are not necessarily synchronized events. Mismatches can accumulate across scales, to a point that cannot longer be ignored. The time it takes for 50% of a graph to disappear does not help either.
Last edited by orion : 12-23-2005 at 08:10 PM.
Join Date: Oct 2004
Excellent post, Ian. I've no idea if you've hit the nail on the head or not, but it's definitely food for serious thought.
Join Date: Aug 2004
Location: Warrenton, Virginia
Since it was specifically brought up as an example, I figure I should address whether www.seobythesea.com in the sandbox.
I don't think so. There's more of a "cobbler's children have no shoes" effect going on there.
The blog was originally created as a one-shot on June 22, for an event hosted in August. It was picked up and linked to within a day or two on Search Engine Roundtable, the SEW blog, Threadwatch, Search Engine Journal, Cre8pc Blog, SEO Book, Cre8asite Forums, High Rankings Forum, Search Engine Watch Forum, SEOmoz, BPWrap, Gray Hat News, and more. With the RSS feeds on those blogs, a search in Google for "SEO by the Sea" was showing more than 13,000 results in less than a week.
That's what it was intended to do.
I've made a few posts there since then, but not really regularly until the last couple of weeks. I didn't do much beyond adding some posts. It suffers from many of the problems of a Word Press Blog, But I did spend a little time today tweaking it to fix some of those issues.
I changed the vhost.conf file to resolve the canonical URL issue, added a robots.txt file to address duplicate content under different URLs, tweaked titles so that post titles come before the site name, fixed or removed some bad links (even no followed one), restricted blog posts to single categories instead of multiples, and I even made my first submission of the site anywhere, at DMOZ. It still has some work needed.
Yahoo! and MSN are a lot more forgiving than Google, but both picked up on the fact that the second blog post was titled the same thing as what was in the title field of the software. The page title shows both blog post title and title field, so the phrase was repeated twice. Both of those search engines list that second page before the domain index page. Guess they give page title a lot of weight.
Sandbox? I've read the papers that Orion cited above, and they are definitely worth reading, but I think that it's possible many sites claimed to be in the sandbox have other issues. I know mine does. Good to have a holiday break to make some shoes for my own children.
Last edited by bragadocchio : 12-24-2005 at 03:06 AM.
Join Date: Oct 2004
A search engine can measure their degree of confidence that a site is good, but it doesn't mean or imply any trust in the site.
Last edited by PhilC : 12-24-2005 at 09:31 AM.
Join Date: Sep 2004
Location: Seattle, WA
Bill, my point isn't that everything is done perfectly on SEO by the Sea, it's that Yahoo! and MSN (and Google prior to March 2004) never did anything like this. If 100 sites pointed to a site with anchor text saying something unique, and that was the title of the site, Google always used to rank that site for that phrase.
A few SEO errors or non-canonical issues couldn't hurt that. Even dual competing names wouldn't affect it - there's a very different filter going on now at GG and that's what I call sandbox.
If there's something else you call "sandbox", please let me know. I've always referred to this effect (of sites that obviously should be ranking for something not ranking for it) as "sandbox."
AND - I bet dollars to donuts you could fix nothing about the site, Bill, and 3-6 months from now, during an update of some kind at Google, your site, along with a few dozen others, would all "suddenly" be ranking competitively for your respective phrases and terms. I've seen this many, many times and it's very consistent.
When SEOmoz "hopped" out of the box it went from ranking in the 60s and 70s for its own name to #1 and ranking top 3 for dozens of other relatively competitive phrases. I didn't "DO" anything to the site and not surprisingly, I got emails that same weekend and saw threads started by other folks who had also jumped out.
Same phenomenon happened with avatarfinancial.com - the first site I observed in the sandbox. Also occuring with Etsy.com - try searches like http://www.google.com/search?q=buy+handmade+online - Etsy should be ranking #1 - their links slaughter anyone else's in the top 20, but nada...
Last edited by randfish : 12-24-2005 at 01:13 PM.
Join Date: Dec 2005
From dictionary.com (Webster’s Dictionary):
In any case, I think you are correct about the "sandbox" being fundamentally about confidence (or trust). As for all of the distinctions being pointed out about what is sandoxing and what is something else, most of it is just semantics that has no effect in the real word as to how one goes about optimizing a web site. I don't really care if Rand's example is a case of what some here call the sandbox and others call something else. What I know for certain (i.e., I am highly confident in the statement to follow and I have a high level of trust in its accuracy) is that the site would have ranked high on page 1 in Google within weeks had it been launched prior to March 2004, and after that time it does not.
This phenomenon is what I call the sandbox. Might there be a number of different filters, algo components, etc. at play in this phenomenon? Certainly. Is it of value to distinguish the various components and identify methods of addressing each in our optimization tactics? Of course. But simply saying "the sandbox doesn't exist" or "other issues are the cause of this..." adds no value IMHO.
SEO & SEM Norway
Join Date: Nov 2004
Location: Oslo, Norway
It is easy to say that great content will be linked to, but the truth is that in a non-english market like Norway there is not much of a culture of linking to others.
This means huge content and old sites have it much easier ranking for a lot of terms, while quality content sites are not going to the top of rankings, something that is quite a flaw in the way Google ranks pages.
Whitehat on...Whitehat off...Whitehat on...Whitehat off...
Join Date: Jun 2004
Sandboxing began as a process which simply delayed the impact of link anchors - once upon a time, dropping tens or hundreds of thousands of links for a semi-competitive term pointing to most sites would show ranking benefits on Google very quickly.
I think it's fair to say since then Google has developed the concept further, so what is now properly meant as the Google Sandbox refers to a set of "tools" within the algo to combat a range of potential spam elements.
However, while I disagree with Mike about his refusal to accept its existence, I think Mike's reasons for his objections are very sound:
Last edited by I, Brian : 12-26-2005 at 05:58 AM. Reason: formatting
Oversees: Search Technology & Relevancy
Join Date: Jun 2004
Most of the research papers I have read on this aging effect that feels, smells and looks like a "waiting period" or "box" have something in common: they discuss the effect of the Structure and Evolution of the Web and new non batch modes for crawling that structure, particularly with incremental web crawlers. Nothing like ending 2005 with more research readings.
Last edited by orion : 12-26-2005 at 06:14 PM. Reason: removing typos
Join Date: Feb 2005
Location: San Diego, CA
Something that Orion referenced struck as possibly being an important factor behind sites I'm seeing leave the box early. It related to a possible perceived value, that links remaining constant on a page that endured other updates, might possess.
This may explain why some of the newer sites I've analyzed which have many natural, in-content blog links (blogs have replies added and are therefore updated) seem to be doing very well. However, that factor alone cannot be heavily weighted in this case scenario considering that a blog post normally receives updates only for a few days after originally posted, while a link directory page will likely receive regular updates as links are added every now and again (and we know how much link directory links are helping these days).
There may be something there though.
Last edited by NuevoJefe : 12-28-2005 at 03:52 PM.
Before deciding whether or not to heed your sage advice on this ridiculous Sandbox notion (an obvious excuse employed by the unwashed masses incapable of grasping even the most essential concepts of good marketing), I would like to know something about your qualifications:
How many completely new Web sites utilizing newly registered domain names you have created in the last two years?
Should the answer be, "0," I would politely suggest that you limit yourself to topics with which you might have some expertise.
Here is the best definition I've read:
"The observed phenomenon of a site whose rankings in the Google SERPs are vastly, negatively disparate from its rank in other search engines (including Yahoo!, MSN & Teoma) and in Google's own allin: results for the same queries."
As far as my own Sandbox experience is concerned I will provide a typical example:
One of my newer sites ranked #13 on Yahoo and #3 and #4 on MSN for a moderately competitive search term, yet was nowhere to be found on Google (#180+). Now I know that their algos differ, but they don't differ that much. Without apparent rhyme or reason, it suddenly emerged from a nearly five month tour of duty in the box to assume position #11 on Google. No spammy tactics, less than fifty hand-picked inbound links and lots of original content. It is presently ranked #3 of 2.44 million result pages.
Mike--how would you explain such a meteoric rise in Google rankings?
Was I (a simple-minded sandbox believer) suddenly able to grasp the elusive intricacies of "good marketing?"
From WebProNews 11/16/05, Q&A With Google's Matt Cutts:
"Does the sandbox exist?
"Here comes the audience participation part: Show of hands? Most say yes. The fact is that there are some things in our indexing infrastructure that could be perceived as a ‘sandbox' effect.'""
"Additionally, at SES London in "Meet the crawlers", a small business raised the problem to Google of new sites being held back from ranking. There was a huge murmur in the room. The Google engineer responded that Google will act as it sees fit to control the SERPs, and effectively acknowledged that they are involved in some process to this effect."
"Some intrepid bloggers came away from the 2005 SES conference in San Jose with confirmation that yes, Google does place some new sites into a sort of temporary holding classification. Rand Fishkin of SEOmoz.org reports on a couple of conversations he had with some SEO gurus, including Google’s Matt Cutts that the sandbox does indeed exist, and it presents a difficult challenge for zealous search engine optimizers:
Greg [Boser of WebGuerilla] & Dave [Naylor] in particular had some choice words about the subject and I commented too. We all shared the opinion that ranking new sites at Google was a pain since the inception of "sandbox" and Matt noted (this is a near word-for-word quote) - "OK, so it's really working. Even on you (guys)."
Fishkin later spoke with some folks at Meet the Google Engineers, who also confirmed the existence of a sandbox, but who also noted that sites go through a filter which determines whether or not they find their way into it. Threadwatch.org member DougS also recalls listening to a Google engineer at SES, saying that the engineer did “openly acknowledge that they place new sites, regardless of their merit, or lack thereof, in a sort of probationary category.”"
Question for Matt Cutts:
"Does gravity exist?"
"Most say yes. The fact is that there are certain observable physical phenomena that could be perceived as a 'gravity effect.'"
Editor, SearchEngineLand.com (Info, Great Columns & Daily Recap Of Search News!)
Join Date: May 2004
Location: Search Engine Land
As it happens, I was at a friend's house yesterday who has a completely new site, only a month and a half old. He was wondering why another site was outranking him. I'm probably going to go into detail about what I found, but fair to say, there was a sandbox effect for everything I could see. If ranked for some terms but wouldn't rank for other ones that it absolutely, positively should have -- given the other terms it was doing well for.
It will be clearer when I get into more details. Waiting for DaveN to get back so we can do some more playing with it. But it all came down to a single word that as far as I can tell, when used caused a different set of ranking criteria to kick in -- and age of site simply had to be a factor in this, esp. given the crud and junk that was outranking it.
Which brings me to this point. My friend's not an SEO, just a guy with a small side business he was starting. He's sitting there not understanding what's going on and likely would start changing various things on the site to fix the rankings. But if it's a sandbox effect, none of that helps necessarily. I really had to think, "what would prevent this person from wasting time, if they didn't know about this." Going to the Google site wouldn't have said something like, "you're outta luck on some terms for 6 to 9 months."
Now I fully and absolutely admit there are things I'm sure will bump him out -- the right trusted links, esp. But if I sat you down and showed you the queries, even you Mike , I think you'd be reaching to sandbox as to why he doesn't rank at Google (but does at Yahoo and MSN).
Join Date: Aug 2005
I Almost Hope there *IS* a Sandbox
First off I should say that I am *not* a professional SEO, so my opinions may be way off base. I am a consumer, and I am the owner of an e-commerce site.
So from the standpoint of a consumer, I'm glad that Google waits a while before giving trust to new websites; it makes me feel more secure that when I am looking at a site that is listed in Google that it has either been around for a while, or there is a lot of buzz around the site. It gives me more confidence in the site.
As the owner of an e-commerce site, it was frustrating at first. We went through the whole, "hey wait, our site is designed well, looks great, performs awesome, meets all of the criteria, why are sites that have been around longer that look worse that have less of a selection showing up and we aren't?" phase.
But at the end of the day, the owner in me is glad that we didn't do better earlier. We have experienced a "ramping up" effect with traffic from Google that has given us the opportunity to tweak our internal opperations. We have been able to work out kinks in our business, hire additional people in places we would not have expected to, changed our look and feel in direct response to those people that did find us, and generally build a better end user experience.
Now that our company is just over a year old, we have a solid website built, our infrastructure is sound, and we are ready to accomodate a larger amount of traffic and business.
Although the site was designed with SEO/SEM in mind, we are now putting more focus on this aspect of the business, and are seeing a steady increase of traffic from Google, and we believe that what people find now is a much richer experience than if they had come to us a year before.
|Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)|