Go Back   Search Engine Watch Forums > Search Engines & Directories > Google > Google Web Search


Reply
 
Thread Tools
  #1  
Old 06-02-2005
dannysullivan dannysullivan is offline
Editor, SearchEngineLand.com (Info, Great Columns & Daily Recap Of Search News!)
 
Join Date: May 2004
Location: Search Engine Land
Posts: 2,091
dannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud ofdannysullivan has much to be proud of
Google Sitemaps Now Accepting Web Page Feeds

Google has opened a new Google Sitemaps program allowing site owners to feed pages for inclusion in Google's web index. Participation is free, but inclusion isn't guaranteed. Google hopes the new system will help it better gather pages than traditional crawling alone allows. Feeds also let site owners indicate how often pages change or should be revisited.

On the SEW Blog, New "Google Sitemaps" Web Page Feed Program has a Q&A on the new program with Shiva Shivakumar, engineering director and the technical lead for Google Sitemaps.

Still have more questions or comments? The Google Sitemaps team will be taking questions and responding in this thread.
Reply With Quote
  #2  
Old 06-02-2005
rustybrick's Avatar
rustybrick rustybrick is offline
 
Join Date: Jun 2004
Location: New York, USA
Posts: 2,803
rustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud of
You're kidding me!

I think it might be important to just clarify that this does not help with ranking a web page. It just helps get your page indexed. Correct?

What are the other benefits?
Reply With Quote
  #3  
Old 06-02-2005
rustybrick's Avatar
rustybrick rustybrick is offline
 
Join Date: Jun 2004
Location: New York, USA
Posts: 2,803
rustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud ofrustybrick has much to be proud of
Sorry, ranking is clarified in the FAQs.
Reply With Quote
  #4  
Old 06-02-2005
toprank toprank is offline
Lee Odden
 
Join Date: Jul 2004
Location: Minneapolis
Posts: 79
toprank will become famous soon enoughtoprank will become famous soon enough
This has to be good news for sites with dynamic urls that are not always crawled properly.

I suppose this applies to blogs as well as any other type of web site, correct?

Last edited by toprank : 06-02-2005 at 11:55 PM. Reason: add
Reply With Quote
  #5  
Old 06-03-2005
Dominic Dominic is offline
Member
 
Join Date: Jul 2004
Location: Australia
Posts: 19
Dominic is on a distinguished road
Our sites are crawled fine at the moment, but I can see how this will save google money on a larger scale. So would be nice to get some benefit for the work involved in setting it up.

If I do use this I think I will open a new gmail / google account for each one of my sites. Otherwise I'm putting my hand up as the owner of all our sites.
Reply With Quote
  #6  
Old 06-03-2005
projectphp projectphp is offline
What The World, Needs Now, Is Love, Sweet Love
 
Join Date: Jun 2004
Location: Sydney, Australia
Posts: 452
projectphp is a splendid one to beholdprojectphp is a splendid one to beholdprojectphp is a splendid one to beholdprojectphp is a splendid one to beholdprojectphp is a splendid one to beholdprojectphp is a splendid one to beholdprojectphp is a splendid one to behold
WOW!!! That is truly some nice work. It can only help with the whole indexing deal, IMHO. What a great initiative. Bravo!!!!
Reply With Quote
  #7  
Old 06-03-2005
shor shor is offline
aka Lucas Ng. Aussie online marketer.
 
Join Date: Aug 2004
Posts: 163
shor is a jewel in the roughshor is a jewel in the roughshor is a jewel in the roughshor is a jewel in the rough
In other words, this is a natural progression(or evolution) of the robots.txt file. Before, we could only set types of pages not to be indexed or followed, while now we can tell G exactly which pages we want crawled.

In the future I believe the sitemap.xml will be integrated into the standard webserver setup. Which is a shame really, since it takes the fun out of creating a clean and easily spidered site when our competitors who have messy, uncrawlable sites are indexed due to their sitemap.xml
Reply With Quote
  #8  
Old 06-03-2005
mcanerin's Avatar
mcanerin mcanerin is offline
 
Join Date: Jun 2004
Location: Calgary, Alberta, Canada
Posts: 1,569
mcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond repute
It doesn't appear to be clear if this is an either/or thing.

Scenario: I make a sitemap.xml for a website, but later on there is a revamping of the site and the file is no longer valid, and someone forgets to update it.

I'm assuming that G will continue to spider the site naturally, and use the sitemap.xml as an additional set of links, but that's actually not very clear from the FAQ, that I can see. What I do see are a lot of mentions on how to make sure that the sitemap.xml is always up to date, including a cron program that would do it.

This means that it's possible that G will not bother trying to spider anything not in the sitemap.xml? I hope that's not the case.

My questions are as follows:

1. What is the default behavior for the spidering of the website - sitemap.xml if it exists and normal spidering as a fallback? If so, what happens if the sitemap.xml is wrong for some reason? The FAQ implies that this will result in incomplete indexing, which further implies that natural spidering does NOT happen, it just leaves.

2. What happens if there is a conflict between the robots.txt (or robots meta) and the sitemap.xml? Who wins? The refusal or the specific invitation?

3. The sitemap.xml is clearly aimed at only one domain. What happens if someone has multiple parked domains? Will natural spidering happen on the others, resulting in possible duplication issues?

4. Related to the "one domain" issue, it's common for people to park a country specific domain on a site in order to let a search engine know that this is (for example) a .ca site. If such a site uses this, what will the result be as far as geolocation is concerned?

That's all I can think of this late at night - sorry if the answer was in front of me and I missed it.

Ian
__________________
International SEO
Reply With Quote
  #9  
Old 06-03-2005
Nacho's Avatar
Nacho Nacho is offline
 
Join Date: Jun 2004
Location: La Jolla, CA
Posts: 1,385
Nacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to behold
Kiss that 1% bye bye

I wonder how long it will take for Y! to follow?
Reply With Quote
  #10  
Old 06-03-2005
dyn4mik3 dyn4mik3 is offline
Michael Nguyen
 
Join Date: Feb 2005
Location: Riverside,CA
Posts: 52
dyn4mik3 is on a distinguished road
For anyone that is running a Wordpress site, I whipped up a php script to generate a Google Sitemap from all their posts. I'm not sure if it will work on all Wordpress sites since I only have access to mine.

If anyone can improve on this please do!
Reply With Quote
  #11  
Old 06-03-2005
projectphp projectphp is offline
What The World, Needs Now, Is Love, Sweet Love
 
Join Date: Jun 2004
Location: Sydney, Australia
Posts: 452
projectphp is a splendid one to beholdprojectphp is a splendid one to beholdprojectphp is a splendid one to beholdprojectphp is a splendid one to beholdprojectphp is a splendid one to beholdprojectphp is a splendid one to beholdprojectphp is a splendid one to behold
I dunno Nacho. Still a lot of value in a trusted feed. Not least of which is content control. The cost may make it prohibitive, but teh control is certainly an attractive feature...

Last edited by projectphp : 06-03-2005 at 03:27 AM.
Reply With Quote
  #12  
Old 06-03-2005
GoogleGuy GoogleGuy is offline
Unofficial Representative
 
Join Date: Jul 2004
Location: Mountain View, CA
Posts: 80
GoogleGuy is a glorious beacon of lightGoogleGuy is a glorious beacon of lightGoogleGuy is a glorious beacon of lightGoogleGuy is a glorious beacon of lightGoogleGuy is a glorious beacon of lightGoogleGuy is a glorious beacon of light
Glad you like it, projectphp! I'm excited that this provides a new way for webmasters to help search engines crawl their site. Google is introducing the technique, plus providing open-source code to generate the sitemaps, but there's no reason that other search engines can't use these sitemaps to improve their crawl coverage as well.

rustybrick, it's true that this won't boost or help a page with ranking. It's mainly intended to make it easier for webmasters to list the pages that they think are helpful for being crawled.
Reply With Quote
  #13  
Old 06-03-2005
GoogleGuy GoogleGuy is offline
Unofficial Representative
 
Join Date: Jul 2004
Location: Mountain View, CA
Posts: 80
GoogleGuy is a glorious beacon of lightGoogleGuy is a glorious beacon of lightGoogleGuy is a glorious beacon of lightGoogleGuy is a glorious beacon of lightGoogleGuy is a glorious beacon of lightGoogleGuy is a glorious beacon of light
dyn4mik3, that was fast! Nice job. I love how your post at http://www.socialpatterns.com/search...ith-wordpress/
includes a link to your XML file. When I hear the words "XML" my eyes usually glaze over, but when you look at your example sitemap file, it makes it really clear how simple/fast it is to create one of these files.

You made me go to figure out the reputation thingie just so I could recommend your post.
Reply With Quote
  #14  
Old 06-03-2005
GoogleGuy GoogleGuy is offline
Unofficial Representative
 
Join Date: Jul 2004
Location: Mountain View, CA
Posts: 80
GoogleGuy is a glorious beacon of lightGoogleGuy is a glorious beacon of lightGoogleGuy is a glorious beacon of lightGoogleGuy is a glorious beacon of lightGoogleGuy is a glorious beacon of lightGoogleGuy is a glorious beacon of light
shor, robots.txt has been around since 1996; it's a good file format, but I do think it's time to give easier ways for site owners to communicate with search engines..
Reply With Quote
  #15  
Old 06-03-2005
GoogleGuy GoogleGuy is offline
Unofficial Representative
 
Join Date: Jul 2004
Location: Mountain View, CA
Posts: 80
GoogleGuy is a glorious beacon of lightGoogleGuy is a glorious beacon of lightGoogleGuy is a glorious beacon of lightGoogleGuy is a glorious beacon of lightGoogleGuy is a glorious beacon of lightGoogleGuy is a glorious beacon of light
Good questions, mcanerin. In general, I would expect the sitemap.xml to augment the normal discovery that we have in our standard web crawl. I'll ping Shiva about the more specific questions and try to get someone (me or someone else) to chime with answers tomorrow.
Reply With Quote
  #16  
Old 06-03-2005
semanticist semanticist is offline
Member
 
Join Date: Mar 2005
Posts: 11
semanticist is on a distinguished road
GG,

Can hardly blame you for coming over to talk about this; that WMW thread was getting pretty painful to watch.

Anyway, this whole XML thing is a great idea. Why spend all the resources on crawling technology when webmasters can (and probably would prefer to) list the important URLs anyway? Content management systems are doing a great job of burying some really good content, and as a result, crawling technology will always be playing catch-up. This goes a long way toward fixing that.

And the self-rating system for page prioritization is a great idea too.
Reply With Quote
  #17  
Old 06-03-2005
SitemapsAdvisor SitemapsAdvisor is offline
Official Google Sitemaps Team Member
 
Join Date: Jun 2005
Posts: 2
SitemapsAdvisor will become famous soon enoughSitemapsAdvisor will become famous soon enough
Good questions mcnerin.

Quote:
Originally Posted by mcanerin
It doesn't appear to be clear if this is an either/or thing.

Scenario: I make a sitemap.xml for a website, but later on there is a revamping of the site and the file is no longer valid, and someone forgets to update it.

I'm assuming that G will continue to spider the site naturally, and use the sitemap.xml as an additional set of links, but that's actually not very clear from the FAQ, that I can see. What I do see are a lot of mentions on how to make sure that the sitemap.xml is always up to date, including a cron program that would do it.

1. What is the default behavior for the spidering of the website - sitemap.xml if it exists and normal spidering as a fallback? If so, what happens if the sitemap.xml is wrong for some reason? The FAQ implies that this will result in incomplete indexing, which further implies that natural spidering does NOT happen, it just leaves.
Exactly. This program is a complement to, not a replacement of, the regular crawl. The benefit of Sitemaps is two fold:
-- For links we already know about thro our regular spidering, we plan to use the metadata you supply (e.g., lastmod date, changefreq, etc.) to improve how we crawl your site.
-- For the links we dont know about, we plan to use the additional links you supply, to increase our crawl coverage.

Quote:
Originally Posted by mcanerin

2. What happens if there is a conflict between the robots.txt (or robots meta) and the sitemap.xml? Who wins? The refusal or the specific invitation?
Robots.txt is the gatekeeper, so we respect that.

Quote:
Originally Posted by mcanerin
3. The sitemap.xml is clearly aimed at only one domain. What happens if someone has multiple parked domains? Will natural spidering happen on the others, resulting in possible duplication issues?
We need a simple authentication mechanism so we can trust the sitemaps submitted are for a specific path or host or domain. So we went with a path-based authentication. More here about the restrictions:

https://www.google.com/webmasters/si...itemapLocation

Wrt duplication issues at crawl time -- this is an internal architectural issue that different search engine crawlers may choose to handle differently. For example a crawler may do some batching, so they crawl a URL only once within some time period.


Quote:
Originally Posted by mcanerin
4. Related to the "one domain" issue, it's common for people to park a country specific domain on a site in order to let a search engine know that this is (for example) a .ca site. If such a site uses this, what will the result be as far as geolocation is concerned?
You can submit sitemaps for each of your hosts. For example, an http://example.com/sitemap.xml vs example.de/sitemap.xml vs example.fr/sitemap.xml. Many such sites have different content (e.g., are in different languages). So I think crawling each of them and appropriately geolocating them will make sense. In case of mirror servers, submitting one sitemap will be sufficient.
Reply With Quote
  #18  
Old 06-03-2005
dyn4mik3 dyn4mik3 is offline
Michael Nguyen
 
Join Date: Feb 2005
Location: Riverside,CA
Posts: 52
dyn4mik3 is on a distinguished road
I have a concern:

How does Google feel about being "pinged" on every site update? There is no mention if there is a penalty if we resubmit our sitemap too often.

It would be real easy to write some code that "pings" Google with a new sitemap every single time a new page is created. I know that the crawler comes by in intervals, but will there be a restriction as to how often we can resubmit?
Reply With Quote
  #19  
Old 06-03-2005
dyn4mik3 dyn4mik3 is offline
Michael Nguyen
 
Join Date: Feb 2005
Location: Riverside,CA
Posts: 52
dyn4mik3 is on a distinguished road
Quote:
Originally Posted by GoogleGuy
dyn4mik3, that was fast! Nice job. I love how your post at http://www.socialpatterns.com/search...ith-wordpress/
includes a link to your XML file. When I hear the words "XML" my eyes usually glaze over, but when you look at your example sitemap file, it makes it really clear how simple/fast it is to create one of these files.

You made me go to figure out the reputation thingie just so I could recommend your post.
Thanks - yeah xml isn't that bad at all.

Yay, I think I'm moving down a distinguished road.
Reply With Quote
  #20  
Old 06-03-2005
DaveN's Avatar
DaveN DaveN is offline
 
Join Date: Jun 2004
Location: North Yorkshire
Posts: 442
DaveN is a name known to allDaveN is a name known to allDaveN is a name known to allDaveN is a name known to allDaveN is a name known to allDaveN is a name known to all
Hey GG, I will look closely at this ... but for clients I think its Great we have been using rss and converting it to statics for awhile on some sites, but this should make things a bit easy.... now if you could give sites a boost that use the XML i think it will really really take off

will get our "Inhouse Customer" CMS system to auto create the XML next week.

DaveN
Reply With Quote
Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT -4. The time now is 09:26 PM.