Special thanks to:
|
#1
|
|||
|
|||
|
Google Sitemaps Now Accepting Web Page Feeds
Google has opened a new Google Sitemaps program allowing site owners to feed pages for inclusion in Google's web index. Participation is free, but inclusion isn't guaranteed. Google hopes the new system will help it better gather pages than traditional crawling alone allows. Feeds also let site owners indicate how often pages change or should be revisited.
On the SEW Blog, New "Google Sitemaps" Web Page Feed Program has a Q&A on the new program with Shiva Shivakumar, engineering director and the technical lead for Google Sitemaps. Still have more questions or comments? The Google Sitemaps team will be taking questions and responding in this thread. |
|
#2
|
||||
|
||||
|
You're kidding me!
I think it might be important to just clarify that this does not help with ranking a web page. It just helps get your page indexed. Correct? What are the other benefits? |
|
#3
|
||||
|
||||
|
Sorry, ranking is clarified in the FAQs.
|
|
#4
|
|||
|
|||
|
This has to be good news for sites with dynamic urls that are not always crawled properly.
I suppose this applies to blogs as well as any other type of web site, correct? Last edited by toprank : 06-02-2005 at 11:55 PM. Reason: add |
|
#5
|
|||
|
|||
|
Our sites are crawled fine at the moment, but I can see how this will save google money on a larger scale. So would be nice to get some benefit for the work involved in setting it up.
If I do use this I think I will open a new gmail / google account for each one of my sites. Otherwise I'm putting my hand up as the owner of all our sites. |
|
#6
|
|||
|
|||
|
WOW!!! That is truly some nice work. It can only help with the whole indexing deal, IMHO. What a great initiative. Bravo!!!!
|
|
#7
|
|||
|
|||
|
In other words, this is a natural progression(or evolution) of the robots.txt file. Before, we could only set types of pages not to be indexed or followed, while now we can tell G exactly which pages we want crawled.
In the future I believe the sitemap.xml will be integrated into the standard webserver setup. Which is a shame really, since it takes the fun out of creating a clean and easily spidered site when our competitors who have messy, uncrawlable sites are indexed due to their sitemap.xml |
|
#8
|
||||
|
||||
|
It doesn't appear to be clear if this is an either/or thing.
Scenario: I make a sitemap.xml for a website, but later on there is a revamping of the site and the file is no longer valid, and someone forgets to update it. I'm assuming that G will continue to spider the site naturally, and use the sitemap.xml as an additional set of links, but that's actually not very clear from the FAQ, that I can see. What I do see are a lot of mentions on how to make sure that the sitemap.xml is always up to date, including a cron program that would do it. This means that it's possible that G will not bother trying to spider anything not in the sitemap.xml? I hope that's not the case. My questions are as follows: 1. What is the default behavior for the spidering of the website - sitemap.xml if it exists and normal spidering as a fallback? If so, what happens if the sitemap.xml is wrong for some reason? The FAQ implies that this will result in incomplete indexing, which further implies that natural spidering does NOT happen, it just leaves. 2. What happens if there is a conflict between the robots.txt (or robots meta) and the sitemap.xml? Who wins? The refusal or the specific invitation? 3. The sitemap.xml is clearly aimed at only one domain. What happens if someone has multiple parked domains? Will natural spidering happen on the others, resulting in possible duplication issues? 4. Related to the "one domain" issue, it's common for people to park a country specific domain on a site in order to let a search engine know that this is (for example) a .ca site. If such a site uses this, what will the result be as far as geolocation is concerned? That's all I can think of this late at night - sorry if the answer was in front of me and I missed it. Ian
__________________
International SEO |
|
#9
|
||||
|
||||
|
#10
|
|||
|
|||
|
For anyone that is running a Wordpress site, I whipped up a php script to generate a Google Sitemap from all their posts. I'm not sure if it will work on all Wordpress sites since I only have access to mine.
If anyone can improve on this please do! |
|
#11
|
|||
|
|||
|
I dunno Nacho. Still a lot of value in a trusted feed. Not least of which is content control. The cost may make it prohibitive, but teh control is certainly an attractive feature...
Last edited by projectphp : 06-03-2005 at 03:27 AM. |
|
#12
|
|||
|
|||
|
Glad you like it, projectphp! I'm excited that this provides a new way for webmasters to help search engines crawl their site. Google is introducing the technique, plus providing open-source code to generate the sitemaps, but there's no reason that other search engines can't use these sitemaps to improve their crawl coverage as well.
rustybrick, it's true that this won't boost or help a page with ranking. It's mainly intended to make it easier for webmasters to list the pages that they think are helpful for being crawled. |
|
#13
|
|||
|
|||
|
dyn4mik3, that was fast! Nice job.
I love how your post at http://www.socialpatterns.com/search...ith-wordpress/includes a link to your XML file. When I hear the words "XML" my eyes usually glaze over, but when you look at your example sitemap file, it makes it really clear how simple/fast it is to create one of these files. You made me go to figure out the reputation thingie just so I could recommend your post. ![]() |
|
#14
|
|||
|
|||
|
shor, robots.txt has been around since 1996; it's a good file format, but I do think it's time to give easier ways for site owners to communicate with search engines..
|
|
#15
|
|||
|
|||
|
Good questions, mcanerin. In general, I would expect the sitemap.xml to augment the normal discovery that we have in our standard web crawl. I'll ping Shiva about the more specific questions and try to get someone (me or someone else) to chime with answers tomorrow.
|
|
#16
|
|||
|
|||
|
GG,
Can hardly blame you for coming over to talk about this; that WMW thread was getting pretty painful to watch. Anyway, this whole XML thing is a great idea. Why spend all the resources on crawling technology when webmasters can (and probably would prefer to) list the important URLs anyway? Content management systems are doing a great job of burying some really good content, and as a result, crawling technology will always be playing catch-up. This goes a long way toward fixing that. And the self-rating system for page prioritization is a great idea too. |
|
#17
|
||||
|
||||
|
Good questions mcnerin.
Quote:
-- For links we already know about thro our regular spidering, we plan to use the metadata you supply (e.g., lastmod date, changefreq, etc.) to improve how we crawl your site. -- For the links we dont know about, we plan to use the additional links you supply, to increase our crawl coverage. Quote:
Quote:
https://www.google.com/webmasters/si...itemapLocation Wrt duplication issues at crawl time -- this is an internal architectural issue that different search engine crawlers may choose to handle differently. For example a crawler may do some batching, so they crawl a URL only once within some time period. Quote:
|
|
#18
|
|||
|
|||
|
I have a concern:
How does Google feel about being "pinged" on every site update? There is no mention if there is a penalty if we resubmit our sitemap too often. It would be real easy to write some code that "pings" Google with a new sitemap every single time a new page is created. I know that the crawler comes by in intervals, but will there be a restriction as to how often we can resubmit? |
|
#19
|
|||
|
|||
|
Quote:
Yay, I think I'm moving down a distinguished road. |
|
#20
|
||||
|
||||
|
Hey GG, I will look closely at this
... but for clients I think its Great we have been using rss and converting it to statics for awhile on some sites, but this should make things a bit easy.... now if you could give sites a boost that use the XML i think it will really really take off ![]() will get our "Inhouse Customer" CMS system to auto create the XML next week. DaveN |
![]() |
| Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
| Thread Tools | |
|
|