View Full Version : Google Sitemaps Now Accepting Web Page Feeds
dannysullivan
06-02-2005, 08:58 PM
Google has opened a new Google Sitemaps (https://www.google.com/webmasters/sitemaps/) program allowing site owners to feed pages for inclusion in Google's web index. Participation is free, but inclusion isn't guaranteed. Google hopes the new system will help it better gather pages than traditional crawling alone allows. Feeds also let site owners indicate how often pages change or should be revisited.
On the SEW Blog, New "Google Sitemaps" Web Page Feed Program (http://blog.searchenginewatch.com/blog/050602-195224) has a Q&A on the new program with Shiva Shivakumar, engineering director and the technical lead for Google Sitemaps.
Still have more questions or comments? The Google Sitemaps team will be taking questions and responding in this thread.
rustybrick
06-02-2005, 09:02 PM
You're kidding me!
I think it might be important to just clarify that this does not help with ranking a web page. It just helps get your page indexed. Correct?
What are the other benefits?
rustybrick
06-02-2005, 09:04 PM
Sorry, ranking is clarified in the FAQs (https://www.google.com/webmasters/sitemaps/docs/en/faq.html#rank).
toprank
06-02-2005, 11:53 PM
This has to be good news for sites with dynamic urls that are not always crawled properly.
I suppose this applies to blogs as well as any other type of web site, correct?
Dominic
06-03-2005, 12:25 AM
Our sites are crawled fine at the moment, but I can see how this will save google money on a larger scale. So would be nice to get some benefit for the work involved in setting it up.
If I do use this I think I will open a new gmail / google account for each one of my sites. Otherwise I'm putting my hand up as the owner of all our sites.
projectphp
06-03-2005, 12:31 AM
WOW!!! That is truly some nice work. It can only help with the whole indexing deal, IMHO. What a great initiative. Bravo!!!!
In other words, this is a natural progression(or evolution) of the robots.txt file. Before, we could only set types of pages not to be indexed or followed, while now we can tell G exactly which pages we want crawled.
In the future I believe the sitemap.xml will be integrated into the standard webserver setup. Which is a shame really, since it takes the fun out of creating a clean and easily spidered site when our competitors who have messy, uncrawlable sites are indexed due to their sitemap.xml
mcanerin
06-03-2005, 02:16 AM
It doesn't appear to be clear if this is an either/or thing.
Scenario: I make a sitemap.xml for a website, but later on there is a revamping of the site and the file is no longer valid, and someone forgets to update it.
I'm assuming that G will continue to spider the site naturally, and use the sitemap.xml as an additional set of links, but that's actually not very clear from the FAQ, that I can see. What I do see are a lot of mentions on how to make sure that the sitemap.xml is always up to date, including a cron program that would do it.
This means that it's possible that G will not bother trying to spider anything not in the sitemap.xml? I hope that's not the case.
My questions are as follows:
1. What is the default behavior for the spidering of the website - sitemap.xml if it exists and normal spidering as a fallback? If so, what happens if the sitemap.xml is wrong for some reason? The FAQ implies that this will result in incomplete indexing, which further implies that natural spidering does NOT happen, it just leaves.
2. What happens if there is a conflict between the robots.txt (or robots meta) and the sitemap.xml? Who wins? The refusal or the specific invitation?
3. The sitemap.xml is clearly aimed at only one domain. What happens if someone has multiple parked domains? Will natural spidering happen on the others, resulting in possible duplication issues?
4. Related to the "one domain" issue, it's common for people to park a country specific domain on a site in order to let a search engine know that this is (for example) a .ca site. If such a site uses this, what will the result be as far as geolocation is concerned?
That's all I can think of this late at night - sorry if the answer was in front of me and I missed it.
Ian
Nacho
06-03-2005, 02:49 AM
Kiss that 1% (http://searchmarketing.yahoo.com/srchsb/sse.php?mkt=us) bye bye :p
I wonder how long it will take for Y! to follow?
dyn4mik3
06-03-2005, 03:08 AM
For anyone that is running a Wordpress site, I whipped up a php script to generate a Google Sitemap from all their posts (http://www.socialpatterns.com/search-engine-optimization/google-sitemaps-with-wordpress/). I'm not sure if it will work on all Wordpress sites since I only have access to mine.
If anyone can improve on this please do!
projectphp
06-03-2005, 03:13 AM
I dunno Nacho. Still a lot of value in a trusted feed. Not least of which is content control. The cost may make it prohibitive, but teh control is certainly an attractive feature...
GoogleGuy
06-03-2005, 04:12 AM
Glad you like it, projectphp! I'm excited that this provides a new way for webmasters to help search engines crawl their site. Google is introducing the technique, plus providing open-source code to generate the sitemaps, but there's no reason that other search engines can't use these sitemaps to improve their crawl coverage as well.
rustybrick, it's true that this won't boost or help a page with ranking. It's mainly intended to make it easier for webmasters to list the pages that they think are helpful for being crawled.
GoogleGuy
06-03-2005, 04:16 AM
dyn4mik3, that was fast! Nice job. :) I love how your post at http://www.socialpatterns.com/search-engine-optimization/google-sitemaps-with-wordpress/
includes a link to your XML file. When I hear the words "XML" my eyes usually glaze over, but when you look at your example sitemap file, it makes it really clear how simple/fast it is to create one of these files.
You made me go to figure out the reputation thingie just so I could recommend your post. :)
GoogleGuy
06-03-2005, 04:18 AM
shor, robots.txt has been around since 1996; it's a good file format, but I do think it's time to give easier ways for site owners to communicate with search engines..
GoogleGuy
06-03-2005, 04:23 AM
Good questions, mcanerin. In general, I would expect the sitemap.xml to augment the normal discovery that we have in our standard web crawl. I'll ping Shiva about the more specific questions and try to get someone (me or someone else) to chime with answers tomorrow.
semanticist
06-03-2005, 04:36 AM
GG,
Can hardly blame you for coming over to talk about this; that WMW thread was getting pretty painful to watch.
Anyway, this whole XML thing is a great idea. Why spend all the resources on crawling technology when webmasters can (and probably would prefer to) list the important URLs anyway? Content management systems are doing a great job of burying some really good content, and as a result, crawling technology will always be playing catch-up. This goes a long way toward fixing that.
And the self-rating system for page prioritization is a great idea too.
SitemapsAdvisor
06-03-2005, 05:06 AM
Good questions mcnerin.
It doesn't appear to be clear if this is an either/or thing.
Scenario: I make a sitemap.xml for a website, but later on there is a revamping of the site and the file is no longer valid, and someone forgets to update it.
I'm assuming that G will continue to spider the site naturally, and use the sitemap.xml as an additional set of links, but that's actually not very clear from the FAQ, that I can see. What I do see are a lot of mentions on how to make sure that the sitemap.xml is always up to date, including a cron program that would do it.
1. What is the default behavior for the spidering of the website - sitemap.xml if it exists and normal spidering as a fallback? If so, what happens if the sitemap.xml is wrong for some reason? The FAQ implies that this will result in incomplete indexing, which further implies that natural spidering does NOT happen, it just leaves.
Exactly. This program is a complement to, not a replacement of, the regular crawl. The benefit of Sitemaps is two fold:
-- For links we already know about thro our regular spidering, we plan to use the metadata you supply (e.g., lastmod date, changefreq, etc.) to improve how we crawl your site.
-- For the links we dont know about, we plan to use the additional links you supply, to increase our crawl coverage.
2. What happens if there is a conflict between the robots.txt (or robots meta) and the sitemap.xml? Who wins? The refusal or the specific invitation?
Robots.txt is the gatekeeper, so we respect that.
3. The sitemap.xml is clearly aimed at only one domain. What happens if someone has multiple parked domains? Will natural spidering happen on the others, resulting in possible duplication issues?
We need a simple authentication mechanism so we can trust the sitemaps submitted are for a specific path or host or domain. So we went with a path-based authentication. More here about the restrictions:
https://www.google.com/webmasters/sitemaps/docs/en/protocol.html#sitemapLocation
Wrt duplication issues at crawl time -- this is an internal architectural issue that different search engine crawlers may choose to handle differently. For example a crawler may do some batching, so they crawl a URL only once within some time period.
4. Related to the "one domain" issue, it's common for people to park a country specific domain on a site in order to let a search engine know that this is (for example) a .ca site. If such a site uses this, what will the result be as far as geolocation is concerned?
You can submit sitemaps for each of your hosts. For example, an http://example.com/sitemap.xml vs example.de/sitemap.xml vs example.fr/sitemap.xml. Many such sites have different content (e.g., are in different languages). So I think crawling each of them and appropriately geolocating them will make sense. In case of mirror servers, submitting one sitemap will be sufficient.
dyn4mik3
06-03-2005, 05:07 AM
I have a concern:
How does Google feel about being "pinged" on every site update? There is no mention if there is a penalty if we resubmit our sitemap too often.
It would be real easy to write some code that "pings" Google with a new sitemap every single time a new page is created. I know that the crawler comes by in intervals, but will there be a restriction as to how often we can resubmit?
dyn4mik3
06-03-2005, 05:22 AM
dyn4mik3, that was fast! Nice job. :) I love how your post at http://www.socialpatterns.com/search-engine-optimization/google-sitemaps-with-wordpress/
includes a link to your XML file. When I hear the words "XML" my eyes usually glaze over, but when you look at your example sitemap file, it makes it really clear how simple/fast it is to create one of these files.
You made me go to figure out the reputation thingie just so I could recommend your post. :)
Thanks - yeah xml isn't that bad at all. :)
Yay, I think I'm moving down a distinguished road.
DaveN
06-03-2005, 06:21 AM
Hey GG, I will look closely at this ;)... but for clients I think its Great we have been using rss and converting it to statics for awhile on some sites, but this should make things a bit easy.... now if you could give sites a boost that use the XML i think it will really really take off :)
will get our "Inhouse Customer" CMS system to auto create the XML next week.
DaveN
Marketing Guy
06-03-2005, 06:33 AM
Great new addition, particularly for new sites! :)
Any plans to add some guides to creating sitemaps for common dynamic content packages, such as forum, blog, etc software (like dyn4mik3 did for Wrodpress)? Given that Googlebot has some issues with sessions IDs for example, a PhPbb site map surely would be a huge boost to forum owners. :) I'm sure the respective for communities for commonly available software will be quick in knocking something together but I would be nice if G could as well! ;)
MG
Jorge
06-03-2005, 07:36 AM
It sounds great, but I don't see how it is not going to be a major change for some, good for some, bad for others. I myself have a medium sized site with a couple hundred pages in the G index, but if I can get the other 1500 I should see some changes in SERPs.
critter
06-03-2005, 07:54 AM
Does the sitemap have to be .XLS (XML) File?
Cheers
Critter
dannysullivan
06-03-2005, 07:54 AM
I want a non-python tool that people can use to generate stuff. I read about the tool, thought "cool," then went to download and fire it up. I quickly messaged the developers they they needed to look at it instead.
How about something on Google itself that would let me plug in my domain, then get back a list of all the URLs you are already have spidered and what the current frequency is like, that I could download say into Excel. I'd love to see things like:
+ URL
+ Page Title
+ Google's current frequency of crawl
+ Google's impression of how often it gets modified
Then I could sort and say, "Woah, that page changes a lot more than that," then use the recommended revisit column to push the recommended frequency of revisits upward.
That's the challenge right now. We've got thousands of pages here, and knowing where to begin in setting priorities is a big challenge. My gut feeling was to do nothing and trust Google is probably guessing right. But that type of info above would make it easier to know if it is not.
The tool could also show pages Google knows about but hasn't index. Then we could say, yes, please get those to -- x, y and z pages are really important.
Alternatively, I'd love to see Google or someone create a spidering tool that really does grab all your URLs that are spiderable, so you could easily add more if you don't have access to python or some type of programmer. I'm thinking about the smaller site owners who still have a lot of content but who can't program. And the blogging tool whipped up above sounds great.
Really interesting to see what others may do, as well. If I had to start, I'd be looking at log data and taking my most popular pages and ensuring they are being refreshed regularly and marked for top priority. Even better if my tool could make a file like this for me, for export.
I have a concern:
How does Google feel about being "pinged" on every site update? There is no mention if there is a penalty if we resubmit our sitemap too often.
It would be real easy to write some code that "pings" Google with a new sitemap every single time a new page is created. I know that the crawler comes by in intervals, but will there be a restriction as to how often we can resubmit?
QUESTION 1 ^^
important question, i just wrote a sitemap for my forum, but there will be an update every few seconds, so what does google think if i ping them for an update of my xml file every few minutes / seconds... what do you prefer?
can't you automatically ask my sitemap if it's updated before you let your spider squeeze my site?
QUESTION 2 ^^
what are you doing against doorwaynetworks with thousands of sites?
it's much easier now to add suche spammer sites to your crawler timetable... or much easier to let you know :( - i don't think you have great ideas against it... - have you ever noticed GoogleAdsense stuffed content networks who only exists to spider other networks? or these wikipedia clones ... they now only need to add there complete url+keyword database to you... and earn money with your adsense programm...
sootledir
06-03-2005, 09:08 AM
Great points, Danny.
It would be nice to be able to paste a URL in the search box and see details about the page. Things such as date first visited, spider frequency, etc. Nothing too controversial, but something helpful.
rustybrick
06-03-2005, 10:04 AM
will get our "Inhouse Customer" CMS system to auto create the XML next week.
As will I, for all our custom CMSs.
But they are all pretty SEF, so not sure how much it will help. Crawl Frequency...New pages...
You and I have other solutions for this, outside of Google Sitemaps.
krisval
06-03-2005, 10:47 AM
GG or SitemapsAdvisor,
How should we treat subdomains? Should we create a different site map for each subdomain or a single sitemap on the main domain?
ex. Should searchenginewatch.com have a different sitemap than forums.searchenginewatch.com
Thanks! And very much thanks for all of the Info.
andrewgoodman
06-03-2005, 12:11 PM
Hi GG,
Very interesting and frankly shocking initiative. This should be a huge stride towards increasing indexability on demand, for any site owner that wants to take the trouble.
Please accept a silly question from a technically-challenged PPC junkie, but I have skimmed through some of the FAQ's and see the logic behind having multiple site maps (some of which may be referring to URL's in subdirectories), and only allowing URL's under those subdirectories to be included in that sitemap file.
More generally though, for smaller sites of 1,000 pages or less, is there any benefit to using multiple sitemaps or is it OK to dump a bunch of URL's into a single sitemap referring to the whole site?
Kudos on the innovation, Google.
HopeSeekr of xMule
06-03-2005, 12:33 PM
Hello. I thought you might want to know about my horrific experiences with Google Sitemap. It's BRAND NEW and I was -- apparently -- the very first person to download their python sitemap generator from SourceForge.
Within five minutes it crashed my development server, a 3200 MHz Pentium 4 with 2GB of RAM running Debian Linux. Just imagine if this had been the production server...the costs for over-utilizng the webserver :o
For the details, see http://www.incendiary.ws/node/94 Please syndicate my content if you want :-)
GoogleGuy
06-03-2005, 01:00 PM
In case SitemapsAdvisor is asleep, I'll try to take a couple. Marketing Guy, I think if anyone produces code for something like Wordpress or phpBB, that would be a great opportunity for us to collect several of those examples and point to them in a Google Blog post. I'll pass that suggestion on to folks. Ultimately, most people won't care if Google writes code for vbulletin or Miva (or whatever) or someone else does; as long as there's code that they can use. :) I wouldn't be surprised if various content management software makers provide this as a nice extra feature, for example. Or in theory a web hosting company could offer this as a value-add service. I'm just riffing though..
GoogleGuy
06-03-2005, 01:06 PM
HopeSeekr of xMule, I'll pass on the feedback. I believe many people have run this internally without problems though. Do you have an unusual setup or quota on your webserver? The format is pretty well-defined, so you don't have to rely on our code to generate a sitemap if you'd rather do it yourself; I think that's what dyn4mik3 did. Or you could start by making one by hand with just a couple page on it, if you want to take Sitemaps for a test drive without running any code. But I'll ask what the expected swap footprint of the python code can be.
hey gg,
i am sorry about my questions.. some posts bevor;
i know it's really difficult for you guys to answer on such questions ... but may be you'va got an answer for the first one... ;)
atm: Google Sitemaps ist down:
Server Error
The server encountered a temporary error and could not complete your request.
Please try again in 30 seconds.
GoogleGuy
06-03-2005, 01:16 PM
krisval and andrewgoodman, I'll take those two questions. :)
It's true that you could do a different sitemap for each subdomain, or for different subdirectories. But if the site has less than 1000 pages, I would lean toward one sitemap at the root of the tree. I think the design is pretty general. That allows someone on an ISP or freehost to list their pages even if they don't have access to the root of a directory (a potential weakness of robots.txt is that it has to be at the root level).
But just because that flexibility exists doesn't mean that you have to use it, and in fact I would recommend keeping it simple for the first few days or first few times that you experiment with Sitemaps. So I'd recommend using one Sitemap for the time being, unless you just love to be on the bleeding edge. :)
rustybrick
06-03-2005, 01:21 PM
GoogleGuy,
Are you worried about crafty Webmasters using this to some how, "unethically", spam the index?
If so, which methods do you think they will try? ;)
Seriously, I was shocked as well. But I think its a great sign of acceptance of the "two worlds". In a way, you are kind of giving non SEOs an easy way into the index. Just one more tool to avoid the SEO consultant. Let me clarify, an SEO is still needed to "optimize" the page, but no longer make the pages "SE Friendly". In a sense....
GoogleGuy
06-03-2005, 01:22 PM
That's funny, Joey! As soon as you said that, I told myself--"it must have gotten Slashdotted." Sure enough, it's the top story on http://slashdot.org/ right now. The info about Sitemaps may be running on a "normal" webserver instead of our custom setup. I'd wait for the stampede-o-technical-folks to subside; maybe read the Slashdot thread or give it a few minutes up to a half hour for the geek-slagging to stop. :)
I alerted the Sitemaps team, but a Slashdotting is hard enough for a regular webserver, let alone one that's doing https. I'd give it a few minutes. :)
rustybrick
06-03-2005, 01:37 PM
On the updated Advanced Google Guidelines (http://www.google.com/webmasters/faq.html) page, you explain "Why is my site labeled "Supplemental"?"
Supplemental sites are part of Google's auxiliary index. We're able to place fewer restraints on sites that we crawl for this supplemental index than we do on sites that are crawled for our main index. For example, the number of parameters in a URL might exclude a site from being crawled for inclusion in our main index; however, it could still be crawled and added to our supplemental index.
The index in which a site is included is completely automated; there's no way for you to select or change the index in which your site appears. Please be assured that the index in which a site is included does not affect its PageRank.
(1) Thank you for updating these pages
(2) Can submitting your site via Google Sitemaps help get a site out of the supplemental index?
Receptional
06-03-2005, 01:39 PM
I am uncertain as to the motives for Google to have chosen such a complicated looking solution to achieve what they say they want to achieve. If they want a datestamp for pages, they simply ask us to datestamp pages with a metatag - no need for all this XML stuff using python surely?
This whole thing seems to have had a massive amount of development involved (though maybe not... I just got an error 502) - which looks technology led rather than market driven and I am sure the things that it says it plans to acheive could have been achieved with much less effort.
So I can only conclude that there are other objectivbes for Googke here as well - which are not so public.
:rolleyes:
GoogleGuy
06-03-2005, 01:43 PM
rustybrick,
1) You're quite welcome! :) There's some inertia on modifying our webmaster pages when the pages are translated into lots of other languages, but it was time to reorganize the webmaster pages. I wouldn't be surprised if we stick with English for now and see if there's any extra info we want to add before it gets translated widely.
2) Potentially. I wouldn't view it as an automatic thing though.
GoogleGuy
06-03-2005, 01:52 PM
I think we're pretty upfront on this one, Receptional. Personally, I had never met an XML file I did like (ha ha!). "Why all these DTDs and labels and tags?" I would say when someone presented me with an XML file--you know, just walking down the hall and someone thrusts an XML file in front of you? I hate when that happens. "What's wrong with a tab-separated text file?" I would say when the Foo group wanted to export their Bar data. But: take a look at dyn4mik3's sitemap if you get a chance. It's actually pretty clear, understandable, and shouldn't be hard to generate or process.
GoogleGuy
06-03-2005, 01:55 PM
Danny, I would love for us to do something like what you mentioned eventually. This is good first step though. Hey, I almost caught up with questions! I'm gonna go see what the techies have to say on Slashdot.
dplazas06
06-03-2005, 01:56 PM
I've been trying for about an hour or so now, and I'm still getting the 502 error... :(
How much longer should I wait to try and access this section?
I got all excited reading about it and can't wait to see it.
D
oilman
06-03-2005, 02:06 PM
Hey GG - at first pass this seems a nice hand out to webmasters so for now I'll say thank you ;)
The sitemap generator is a great idea for for folks. I'm wondering if we could use it as a good indicator of whether or not Googlebot can crawl our sites properly. If the generator gets weird results would that be a good indication that indexing problems may exist that are causing the bots to stumble?
GoogleGuy
06-03-2005, 02:20 PM
oilman, the code that makes the sitemaps is pretty independent of the crawl logic, but if the Sitemap generator creates strange Sitemaps, I'd feel free to assume that you may have a weird site.
dplazas06, I just got through to the page--I'd try again. Looks like the Slashdotting is subsiding now. It was a pretty big one--I think a lot of people are going to be interested in using this. :)
agreen1125
06-03-2005, 02:59 PM
pardon my stupid question...do google only accepts sitemaps written in XML? or i can also submit sitemap in html format?
thanks a bunch :confused:
A good move from Google imho. Maybe this can be the start of moving further in working with webmasters for mutual benefit, win win stuff like this is a good place to start.
Sitemap is an obvious copy of ROR files (a patent-pending technology): http://www.rorweb.com .
Relevancy
06-03-2005, 03:37 PM
Will this sitemap xml file find other directories and files that are not connected to the site through linking? What if people have unthemed content or old files that are on their server? Can someone explain that to me.. this is getting complex.
or am I just missing something about this?
sootledir
06-03-2005, 04:51 PM
Relevancy,
You specify the files that go into the sitemap, so any file you tell it will be included.
dannysullivan
06-03-2005, 04:51 PM
I can only conclude that there are other objectives for Googke here as well - which are not so public
I've seen a few other references to things like this, to the effect of Google must be up to something.
Sure, they could be. But why do this? Mainly because webmasters have been asking for it. We used to have this. Infoseek used to let you send a list of URLs for inclusion until that went away 98 or 99.
Since then, we had the occasional idea of having feed programs come up, not just URLs but actual pages. Makes more sense in a lot of ways to feed the content rather than the search engines guessing.
We're talking URL feeds here, rather than actual page content being fed as say Yahoo's paid inclusion program allows. And there's no guarantee for inclusion, of course. But it's a huge first step, and big kudos to Google for giving it a go.
No one is being forced to use the program, but plenty may want to, and I'm glad they have the option. I'm all for options, and it's great as a site owner to see more of them coming our way on the editorial/organic side of things.
I'm especially pleased because I've made a few bigs about site owners feeling like second class citizens in the past, when we watch feeds take for blog content or say Froogle taking feeds, but the core web page system hasn't given authors many goodies. So I love having this type of attention spent on us.
PhilC
06-03-2005, 05:18 PM
A question, GG (or the other guy):
I'm in the middle of reading through Google's sitemap pages, which suggest that sites that have pages behind forms (hidden from spiders) would benefit from the system. At first glance, that sounds good, but Google relies heavily on links (and link text), both for rankings and even for getting pages indexed at all. I.e. an orphan page won't be indexed even if it manages to get found somehow - maybe from the submission form. Many people, including myself, use alternative methods of getting those 'hidden' pages crawled.
I'm assuming that the Sitemaps system doesn't modify the other 'normal' systems, so pages behind forms that have no links pointing to them still won't be indexed, even if their URLs are placed on the sitemap. Am I right in that assumption?
The reason I ask is because people could be led into not going the route of an alternative, crawlable path for those pages, and rely on Sitemaps instead, which means that their pages may get crawled, but they won't get indexed. Or, at best, they may languish in the Supplemental index, which is the last thing that anyone wants.
PhilC
06-03-2005, 06:04 PM
More questions:
I've been on the web since 1997, and as a long-time programmer, I've done backend stuff from the start, but I've never needed a server to have Python installed - until now. A great many Apache servers don't have it (yes/no?) but they all have PHP and Perl, so why Python? Or is it just that I haven't come across Python? Are there any plans to produce sitemap generator versions in other serverside languages?
A simple text file is the alternative to XML, and one that almost everyone can understand very easily. It allows for such parameters as priority, last updated, etc., but a text file won't inform Google when the sitemap file has changed, whereas the sitemap generator does. Is there another way that text file users can inform Google when there's been a change? What happens if Google isn't informed when there's been a change? Does the sitemap just sit there never to be crawled again, or is there a schedule for crawling them whether or not a change has been notified?
I haven't used the system yet, but is it ok to use the "Add a sitemap" page to submit the same sitemap every time it is udated? If that's ok, it would be an answer to the second paragraph.
PhilC
06-03-2005, 06:23 PM
Sorry - I'm hogging this thread now - but it's not my fault if I'm the only one here :(
Suggestion:
It would be really useful if this system were also used to get pages quickly dropped from the index. A minus sign in front of a URL is all that would be needed. People are often wanting old URLs to be dropped for various reasons, but usually have to wait months until Googlebot finally realises that the page ain't there any more.
DaveN
06-03-2005, 07:02 PM
Phil python : do you mean most commerical servers ?...
Python is dead easy to install, and I agree odd choice ... GG was this an engineers side project the 20% thing ?...
PS I do love it :)
DaveN
GoogleGuy
06-03-2005, 07:16 PM
PhilC, good suggestion on the minus sign. Google is actually a pretty big Python place; that's my hunch on why we provided an example tool in Python, and quite a few people use Python these days. I wouldn't be at all surprised if people build up tools pretty quickly in different languages (C++, Perl), and probably for different web server types as well.
To cover your other question, this is something that will augment our existing ways of finding pages via links. I would definitely recommend that people continue to get relevant links and build normal sitemaps as well. Think of this as a nice way to give hints to search engines, but people shouldn't stop doing the normal work that they would do.
PhilC
06-03-2005, 07:20 PM
I don't know, Dave. It's just that I haven't personally come across Python on a server.
I agree - it's a knockout system. How often do we see someone asking how they can get Googlebot to crawl their site, or to crawl deeper because it only goes so far and then stops. Daily? This system should be the answer to them all.
GoogleGuy
06-03-2005, 07:42 PM
Once you get it installed, Python is a really nice language to use. I was a PHP fan when I joined Google, but for some reason Python really resonates with me and a lot of engineers here; it's got nice, clean use of whitespace. Perl never really attracted me that much. I wrote some Perl programs, then when I came back to them in a month or two, it just looked like a punctuation monster had barfed all over the place. Then again, maybe that was just my really poor way of writing Perl. :)
An early version of the Google crawler was actually written in Python a long time ago, I believe.
PhilC
06-03-2005, 07:53 PM
I wrote extensively in Perl in my early web times, but I never took to it, so you have an ally there, GG. PHP is much nicer, but still a long way from being a nice programming language. I've never looked at Python because I've never come across it except for its name. I'm sure that tools will come out quickly enough, as you said. It's just odd that the first one would be produced in a 'minority' language.
DaveN
06-03-2005, 07:57 PM
Googleguy that Crawer at peak speeds could only crawel 25 web pages per second... guess you guys wrote a new one ;) ...
with a 10 meg file limit on the sitemap xml.. will this cause webmaster server melt downs... or are we going to see a new crawler for these..
DaveN
littleman
06-03-2005, 08:46 PM
I see two main reasons why they are doing this...
1 to keep the pressure off of their indexer, they could cut back to clean URLs and then tell people that if they want their "deep" links indexed it is up to them to submit them.
2 to take the wind out of the wings of services like searchdex.
littleman
06-03-2005, 08:54 PM
Off topic:
GoogleGuy, you may like Lua (http://www.lua.org/).
Relevancy
06-03-2005, 10:37 PM
GG, if the point of this is to help webmasters get their pages found and indexed and to take the load off of the Google indexers then couldn’t this fluid Google with tons of dynamically generated crap pages that are only created for the search engines? Making the index bigger but, diluting it with junk?
Even if you have some sort of filtering in place, you will still have massive amounts of SE generated crap pages that look real enough and therefore killing smaller sites that have dedicated hand crafted true content pages. Small sites will not be able to compete with the big junk sites. Even big junk can be accidentally be seen as a respectful thing.. look at MSN :)
Am I the only one that doesn't think "more pages the better" Quality sites can be small. Don't kill them.
projectphp
06-04-2005, 06:00 AM
I am really excited about this. IMHO, this is a win (Google), Win (website owners) win (consumer) idea. I can't imagine a better step forward, and this is beyond anything I thought an SE would do back when Danny posted the Ideas For The Indexing Summit (http://forums.searchenginewatch.com/showthread.php?t=4139) way back when.
Bravo, and a round of a applause all 'round.
To put on my mother-in-law, never good enough for my kindred hat, the fact it is a uni-lateral policy (excuse the Reaganite terminology), and not something the W3C and other players are onboard with is, IMHO, a bit unfortunate. Like the link nofollow attribute, Google is running solo, and that is not ideal. With the nofollow initiative, we still lack clear definitions of what it means, and I wish there was a one page sheet of all the stuff a search engine does and does not do.
There is still work to be done improving the communication between SEs and sites, and what can be communicated to make both side's life easier.
I think everyone with a vested interest, from the engines to the W3C and webmasters, needs to get together and create some standards.
Robots.txt, as an example, is still in the same state as, what, like 1995? We now have no a few proprietary commands and metatags, and no real consistency.
But I love this new SiteMaps initiative, and would just like to see a more multi-lateral approach. But hey, I was against the war, so my multi-lateral leanings are firmly established :)
PhilC
06-04-2005, 07:05 AM
Standards are created by popular use, but I do know what you mean. The two largest engines handle the rel attribute, so it immediately became not only "the standard", but also "a standard" - slight difference in meanings. This Sitemap system isn't yet the standard, but it may quickly become it. It will only need Yahoo! to make use of it, and, hopefully, MSN. I much prefer big players to get on and initiate things like this, and the rel attribute, rather than wait around while small organisations, like W3C, drag their feet. Besides, this one is not within the compass of what W3C deals with.
What we don't want, though, are the big players to do the same things in different ways. Yahoo! were excellent when Google introduced the rel attribute, because they quickly supported it, even though they would have prefered to do it another way. If they are going to do anything similar to Sitemaps, it would be very useful if they simply used this new file format, perhaps with additional parameters if they want them, which Google's system should already be programmed to ignore without them making the files unusable, although standardising this isn't as important as standardising the rel attribute.
PhilC
06-04-2005, 12:41 PM
There are still a few of questions that I'd like answers to please, GG or SitemapsAdvisor:-
(1) Once Google has been informed that a Sitemap page exists at a particular URL, through the Add a Sitemap page or running the Python script, will the file be spidered routinely (with some sort of regular frequency) or will it only be spidered after each time that Google is notified of a change? In other words, can a registered/added file be simply updated and left for the spider to get it when it wants to, or must Google always be notified before the spider will come out again?
(2) Not much is said about filename extensions. I imagine that many people will use the text file option, due to the lack of Python on some servers. Is it necessary to use the .txt extension for text files, or will any extension do, and the parser can work out what the file contains?
(3) How can text file people inform Google of a change to the file(s)? Is it ok to use the Add a Sitemap page to re-add the same file(s) and bring the spider out that way?
figment88
06-04-2005, 03:18 PM
IMHO sitemaps are a nice move toward indexing with permission.
A bigger problem with the status quo than not getting enough pages indexed is search engines crawling websites without permission.
The absence of an explicit disallow statement in a robots.txt should give search engines complete access to a website. Instead they should look for explicit allow statements.
BTW on a side note, GG I completely agree with you that often simple delimited files are far preferrable to xml. I keep making this arguement to folks at places like amazon and chefmoz to no avail :)
Web Design Pros
06-04-2005, 05:27 PM
The Google Sitemaps initiative is a very good thing.
We've all seen many sites that were never completely indexed.
With Google Sitemaps the problem is solved!
I'm sure it is a most difficult task to write a spider like GoogleBot, that must index every form of quality of html on the entire internet.
By shifting the emphasis off the robot programmers and onto the willing webmasters, GoogleBot can just spider the links and index the content it finds there without wasting time and other resources in unintentional/intentional spider traps. It's brilliant! :D
BTW it seems to take an hour from submission time for the stats page to go from Pending to OK.
Google Guy, I wonder what OK means. Does it mean that they validated the content or that it has been queued for spidering?
We'd love to see a PHP version. It's easily the most popular server side scripting tool. Perl would be OK too.
We know that independant tools could also be created, maybe in java or .net just as long as the end result is a sitemap.xml.gz file.
Web Design Pros
06-04-2005, 05:34 PM
I'm just anxious to see the difference this makes to the quality of a sites index. This could do away with URL rewriting if the other major SEs adopted it as a standard.
SitemapsAdvisor
06-05-2005, 04:34 AM
Hey folks:
agreen1125: https://www.google.com/webmasters/sitemaps/docs/en/faq.html#s8 and #s9 talk about alternate formats. However, using the Sitemaps format gives search engines more data to work with. You may also want to follow http://groups-beta.google.com/group/google-sitemaps, since a lot of folks are helping each other out with specific instances.
PhilC: Good questions :)
-- We anticipate a lot of the crawling policies will be done differently by different search engines/crawlers. We currently occasionally recrawl the sitemaps. The /ping mechanism discussed in the protocol (https://www.google.com/webmasters/sitemaps/docs/en/faq.html#s4) is a simple way to update us of changes. This is true for the XML based and txt based sitemaps. Again search engines may not get to your sitemap right away, if they are overloaded or think your pings are spurious.
-- We left out the names for the files deliberately, so webmasters can do what is most convient for them. E.g., use a .php script or a dynamic cgi as the place to pick up a sitemap. The sitemap_gen.py defaults to sitemap.xml and sitemap_index.xml, unless you specify something else.
Web Design Pros:
--Currently Ok means we have crawled your sitemap, validated the XML, parsed out URLs and metadata and have sent them to a queue for a regular crawl. But usually it is good to look at the webmasters/sitemaps frontend, so you can see what the latest errors are. We are also learning about what mistakes people make on their sitemaps and things that are unclear -- this is new to us too :)
sootledir
06-05-2005, 08:01 AM
I'm very impressed.
All told, it took about one hour to install Python, set filters to exclude files and to generate the sitemap. Googlebot seems to have gotten active after that and came and crawled many of the pages.
This can be run easily enough with a CRON job and a ping. Very impressive, all in all.
PhilC
06-05-2005, 11:34 AM
Thanks, SitemapsAdvisor. I hadn't read the "ping" part. So the submitting can even be done by requesting the URL in a browser - excellent.
Another quick question, I think I've read something about it but I don't remember the details, and it's not covered in the FAQs. Is it absolutely necessary to compress the file(s) or will an uncompressed file do as long as it is within the 10 meg limit - e.g. "mysitemap.txt" or "mysitemap.xml"? Also, if compression is necessary, will a zip file do or must it be a .gz file?
Oddly enough, I can't even see anything in the FAQs about the 10 meg limit.
DaveN
06-05-2005, 03:09 PM
PhilC . it's in the docs ... under "Providing Multiple Sitemap Files"
https://www.google.com/webmasters/sitemaps/docs/en/protocol.html#sitemapFileRequirementsml
DaveN
PhilC
06-05-2005, 06:56 PM
I can't see the answer to my questions there, Dave. The only reference is that a Sitemap file must be no more than 10 meg when uncompressed. It doesn't say whether or not the files *must* be compressed and, if so, which compressions are ok to use.
To be honest, the pages there are not particularly clear unless you are going to use Google's own generator. Everything needs to be encoded this way or that, which will confuse a lot of people. It would be nice to know if a simple uncompressed text file, with the .txt extension, will do, because that's what a lot of people would opt for. And zipping it rather than gunzipping would be popular if it were clear that zip files are ok.
To be honest, the pages there are not particularly clear unless you are going to use Google's own generator. Everything needs to be encoded this way or that, which will confuse a lot of people. It would be nice to know if a simple uncompressed text file, with the .txt extension, will do, because that's what a lot of people would opt for. And zipping it rather than gunzipping would be popular if it were clear that zip files are ok.
Hi PhilC, Google says: (https://www.google.com/webmasters/sitemaps/docs/en/protocol.html#faq_compression)
Q: Can I zip my Sitemaps or do they have to be gzipped?
Please use gzip to compress your Sitemaps.
PhilC
06-05-2005, 08:56 PM
Aha! Thank you. I missed that bit hidden at the bottom :(
Even so, the pages still don't say whether or not uncompressed files are ok. On the whole, people won't be able to use gzip unless they use somebody's script to do it for them. My gut feeling is that, until general purpose scripts are available that everyone can use without any technical knowledge at all, other than FTPing it to the site, the majority of people would prefer to simply make a text file and submit it. Zipped is ok for those people, but gzipped isn't.
I don't think that a Python script is general purpose enough, because I think that too many servers don't have Python installed. That's why I would really like to know about plain old text files, and what Joe Bloggs can do on his own.
Web Design Pros
06-05-2005, 10:30 PM
We can help you generate a sitemap.xml.gz file for your site if you need the help. Just PM me to get started.
PhilC
06-05-2005, 10:40 PM
Thanks WDG, but that's not a problem. I'm looking at website owners in general, and wanting to know what they can and can't actually do. It would be disappointing if the system couldn't be used by everybody without having to pay someone to do it for them.
DaveN
06-06-2005, 04:54 AM
wow Google where quick..
http://www.google.com/webmasters/guidelines.html
DaveN
Receptional
06-06-2005, 05:14 AM
I think we're pretty upfront on this one, Receptional.
OK then GG. That's alright then.:cool: I still think this is a lot of development work for something that could have been achieved using Googlebot and a tag and or two on a bog standard HTML page.
Paranoid, from Preston.
SebastianX
06-06-2005, 05:03 PM
I still think this is a lot of development work for something that could have been achieved using Googlebot and a tag and or two on a bog standard HTML page.
Actually, there is not a lot of scripting necessary. I can't imagine how it could be done with a tag on a bog-standard page. I've written down my thoughts on GoogleSitemaps, trying to explain the simplicity of this service:
http://www.smart-it-consulting.com/article.htm?node=133&largePage=TRUE
I hope this tutorial can be helpful, it comes with free code and narrows down the apparent complexity on a first view of Google's documentation.
sootledir
06-06-2005, 07:25 PM
Philc, the file DOES NOT have to be compressed.
littleman
06-06-2005, 07:42 PM
What happens if a one submits a subset of a domain? Will the URLs which are not submitted be removed?
PhilC
06-06-2005, 08:43 PM
Thanks scootledir. How do you know? Have you found the info somewhere or have you tried it?
PhilC
06-06-2005, 08:45 PM
No, littleman, the omitted URLs will not be affected. This sytem is to tell Google, and hopefully other engines in the future, about URLs, but it doesn't tell them that these are the only files that need crawling. The normal spidering goes on as normal. This thing is just an extra.
sootledir
06-06-2005, 09:51 PM
I tried it. It accepted a plain sitemap.xml file. I saw no need to compress a file with 200 urls in it.
PhilC
06-07-2005, 07:39 AM
Excellent. That's what I wanted to know - thanks!
DaveN
06-07-2005, 07:59 AM
do you think adding the sitemap.xml will add a trust value to the site :)... it can't be spam because of A,B and C ..
Just a thought still testing .. but it adds another tick in the right box
DaveN
DaveN
06-07-2005, 08:02 AM
on the compression front I think that's more of web masters benefit than googles, bandwidth and all that ..
DaveN
PhilC
06-07-2005, 08:08 AM
It probably is Dave, but it needs to be known because gzip isn't available to most people.
DaveN
06-07-2005, 08:17 AM
it's standard on most linux boxes, and so is python... well on my linux boxes it is anyway .. ok too much info on my servers, GoogleGuy ignore what i said i'm NT4 all the way :)
DaveN
PhilC
06-07-2005, 08:28 AM
Yes, but most people on Linux servers can't use it. Remember that most website owners know nothing about servers - some of them have no idea what an FTP client is - they do it via the browser, but Sitemaps is for website owners in general, and not just for server savvy people. I'm probably far more technical than most people here - right down to the circuits. I've been programming for more than 20 years, and I've been doing extensive serverside programming for 8 years, mostly on Apache servers, but I've no idea how to use gzip. If I need to find out how to do it, then imagine how it must be for website owners in general. Even configuring a script file is beyond most website owners. So my questions have been for website owners in general, and not for people who know how to do these things or who can easily sort it out.
People need to know the simplest way of using the Sitemaps system, and that's what I've been trying to ascertain.
Solideo
06-07-2005, 01:00 PM
Wondering if someone can clarify what types of sites are permitted (or not permitted) to use Sitemap... Since a G account is required, and the account TOS stipulate that services rendered under the account are to be used for "personal, non-commercial use only", am I correct in assuming that ecommerce sites are prohibited from using Sitemap?
To further clarify, the TOS go on to say that "You may not use the Google Services to sell a product or service, or to increase traffic to your Web site for commercial reasons, such as advertising sales". The maybe-stupid question then becomes, how does this apply to sites funded by affiliate programs? Or... sites running Adsense?
Thanks
PhilC
06-07-2005, 01:03 PM
Any type of site is permitted, and a Google account isn't necessary if you use the ping option to submit Sitemap files - it can be done from your browser.
dynamedia
06-07-2005, 02:53 PM
Wondering if someone can clarify what types of sites are permitted (or not permitted) to use Sitemap... Since a G account is required, and the account TOS stipulate that services rendered under the account are to be used for "personal, non-commercial use only", am I correct in assuming that ecommerce sites are prohibited from using Sitemap?
I'd definitely second that there's a need for some clarification from Google on that point. I'm responsible for a couple of sites that pretty much rely on Google traffic - to the point that a mistake could mean loss of jobs (mine especially!). I'd very much like to submit a sitemap, but until this point has been made clearer, I'm having to err on the side of caution.
PhilC
06-07-2005, 02:57 PM
Well, GoogleGuy is on holiday for a week, and SitemapsAdvisor seems to have gone missing - maybe they disappeared together :D
I wrote a little tutorial on my blog on how to create a Sitemap for a large site using Xenu Link Sleuth and Excel. One of my sites has almost 10,000 pages and I wanted to get a Sitemap created over the weekend to test. I am holding off on using development resources until we see the effects.
http://www.ethangiffin.com/?p=29
Ethan
p.s. - I too think this a great idea
agreen1125
06-07-2005, 03:53 PM
the other files needed? Is there any other file/s that needs to be created other than sitemap.xml? I'm a newbie in xml and based on what i've read so far you have to have a tag definition file.
Any body kind enough to shed some light? Right now I only have sitemap.xml on my site.
thanks.
Andy
SebastianX
06-07-2005, 04:03 PM
You need only the sitemap.xml file, as long as you don't list more than 50,000 URLs or the file size exceeds 10 megs, whatever occurs first. In this case you have to slice it, providing multiple sitemap-n.xml files and a sitemap-index.xml file pointing to all sitemaps. You can use any file name, even a script like sipemap.asp pulling the stuff from the database on request by the bot. Google uses the xml-file/script you submit. I've explained it more detailed here:
http://www.smart-it-consulting.com/article.htm?node=133&largePage=TRUE#part4
HTH
martinuboo
06-07-2005, 04:09 PM
I wrote a little tutorial on my blog on how to create a Sitemap for a large site using Xenu Link Sleuth and Excel. One of my sites has almost 10,000 pages and I wanted to get a Sitemap created over the weekend to test. <snip>
http://www.ethangiffin.com/?p=29
Ethan
Thanks Ethan! :) That seems like a very straight forward, simple way to create the static xml file. I'm holding off on the Python tool, since I have already heard of a few horror stories on server meltdown. :eek:
http://www.threadwatch.org/node/2760
martin
Web Design Pros
06-08-2005, 12:56 PM
Please help us test a sitemap.xml.gz file Generator tool currently under development.
It has most of the functionality you would need to create the file for your site:
It has the following features.
index your site
obey keyword filters
generate a sitemap.xml file
generate a sitemap.xml.gz file
ftp a sitemap.xml.gz file to your server
Ping the googlebot to come read your sitemap
Visit the link below for details:
Beta test the sitemap.xml.gz Generator for use with Google Sitemaps (http://www.web-design-pros.ca/forum/viewtopic.php?p=24#24)
We're using a forum topic to capture the feed back.
Please register to participate.
Web Design Pros
06-08-2005, 09:44 PM
We're starting to see that the links included in our sitemap.xml.gz file that we submitted 4 days ago is starting to appear in the Google index!
If you'd like to build a sitemap for your site without installing Python on your server.
See our post about Beta testing a sitemap.xml.gz generator tool (http://forums.searchenginewatch.com/showthread.php?p=49941#post49941)
PhilC
06-08-2005, 09:48 PM
The sitemap itelf is in the index? I wouldn't have expected that to happen, and I can't see any point in it being indexed. Surely they have made a mistake. Or do you mean that some of the URLs it contains are appearing in the index?
<added>
I can't delete this post or I would. I see that you altered your post to make it more accurate :)
Web Design Pros
06-08-2005, 09:52 PM
The sitemap itelf is in the index? I wouldn't have expected that to happen, and I can't see any point in it being indexed. Surely they have made a mistake.
The links in the sitemap.xml.gz file are now in the index.
They still need to be meta tag indexed.
That must happen in the second pass.
agreen1125
06-09-2005, 02:30 PM
i create multiple copies of sitemap.xml on each directory? or just include all of my sub's to one sitemap.xml on the root?
thanks
Web Design Pros
06-09-2005, 05:22 PM
i create multiple copies of sitemap.xml on each directory? or just include all of my sub's to one sitemap.xml on the root?
thanks
The decision depends on the size of the site.
The root is fine if it for a site with under 50,000,000 pages.
If you have more than that then you need to divide by directory.
chinook
06-09-2005, 05:56 PM
We have released an ASP.Net script that will parse through an IIS site and also parse through log files if they are available. The log file parsing really helps with dynamically generated urls ( for instance shopping carts). The results are outputted as an XML file that conforms to the google sitemap specification.
sitemap.chinookwebs.com (http://sitemap.chinookwebs.com)
PhilC
06-09-2005, 05:56 PM
The root is fine if it for a site with under 50,000,000 pages.That should be 50,000 (thousand) pages, not 50,000,000 (million) ;)
If you do split the pages into several sitemaps, you need to to create an index to them.
Web Design Pros
06-10-2005, 12:28 AM
Please help us test a sitemap.xml.gz file Generator tool currently under development.
It has most of the functionality you would need to create the file for your site:
...
Visit the link below for details:
Beta test the sitemap.xml.gz Generator for use with Google Sitemaps (http://www.web-design-pros.ca/forum/viewtopic.php?p=24#24)
We're using a forum topic to capture the feed back.
Please register to participate.
There is a new version written in Java that should be able to run on every OS including:
Linux
Unix
Windows
MacOS
luisbetancourt
06-10-2005, 12:32 PM
If my site has the following structure:
http://subdomain1.domain.com/
http://subdomain2.domain.com/
.
.
http://subdomainN.domain.com
Do I need to have a site map for each subdomain? :confused:
agreen1125
06-10-2005, 02:54 PM
a stupid newbie's question :)
how can I tell if G spider or any spider in fact visits my page? :confused:
Thanks..
luisbetancourt
06-10-2005, 03:06 PM
Check your access log....
Look for lines with the words "Googlebot", "Yahoo Slurp", etc... :)
benwalsh
06-11-2005, 11:17 PM
This looks good to me! A way of telling googlebot which files to crawl.
I have been having problems with my forums not being indexed as well as they might, last year all was well, then i upgraded my script (Ubb Classic) and moved to a new server.
my urls changed and thousands of threads where removed from the google cache, to be replaced by only hundreds. I think my problem could lie with the fact that the forums can be accessed from a variety of urls e.g. php and cgi
For some time i have considered telling robots.txt to ignore the .cgi and just use the .php, but was always loath to tell googlebot to stay away at all, and do not like the word disallow that we are forced to use, now i see a way of putting up positive instructions.
what i have done today is to produce a script that outputs a .xml file containing the php urls of today's active topics.
i wish to make this a just list of urls, and see no need to even date each one, as all will be todays, i am looking for the simplest file i can get
say i have a file example.com/todays-topics.xml
that looks like:<?xml version="1.0" encoding="iso-8859-1" ?>
<sitemapindex xmlns="http://www.google.com/schemas/sitemap/0.84">
<sitemap>
<loc>http://www.example.com/ubb/ultimatebb.php/topic/5/13853.html</loc>
</sitemap>
<sitemap>
<loc>http://www.example.com/ubb/ultimatebb.php/topic/4/437.html</loc>
</sitemap>
</sitemapindex>
Is this right and what if any other tags must i have?
thnx and wow to think gg is reading this, looks to me if we both save resources if googlebot hits on only the right files on our site, i am quite happy to make a daily list and ping google when it is ready, i will even do this religiously at 00:00 GMT here's hoping i am right in thinking that a combination of this and robots.txt can solve my dilemma.
SebastianX
06-12-2005, 04:18 AM
Will not work. The sitemap-index file is not meant to list your pages, it must be used when you have more than 50,000 URLs to point to your sitemap files, each containing less than 50,000 URLs.
If you really don't want to date your pages (populating last modification) and if you really are not interested to tell Googlebot details about change frequency and crawling priorities from your POV, you can produce an urlset containing just the location of your pages. All other attributes are optional. Explained here:
https://www.google.com/webmasters/sitemaps/docs/en/protocol.html
http://www.smart-it-consulting.com/article.htm?node=133&largePage=TRUE#part3
HTH
Sebastian
benwalsh
06-12-2005, 06:11 AM
understood that i do not have a site map here. What do i have? i have a list of updated files that i would like to submit to google. can i use this to submit them?
what is the basic file structure, i am happy to include date or any other tag that is essential. i am just looking for the basic file structure that will allow me to submit updated or new files.
PhilC
06-12-2005, 08:07 AM
The Sitemaps system won't cause Googlebot to *only* crawl the URLs that you provide. It's an addition to the normal crawling.
SebastianX
06-12-2005, 09:25 AM
>i am just looking for the basic file structure that will allow me to submit updated or new files.
Here you go:
https://www.google.com/webmasters/sitemaps/docs/en/protocol.html
benwalsh
06-12-2005, 10:20 AM
understood i will need to use robots.txt to exclude,
for now though i wish to establish the best format for my xml file, is it:<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.google.com/schemas/sitemap/0.84">
<url>
<loc>http://www.example.com/</loc>
</url>
<url>
<loc>http://www.example.com/ubb/ultimatebb.php/topic/1/2681.html</loc>
</url>
<url>
<loc>http://www.example.com/ubb/ultimatebb.php/topic/2/5671.html</loc>
</url>
<url>
<loc>http://www.example.com/ubb/ultimatebb.php/topic/9/2130.html</loc>
</url>
<url>
<loc>etc</loc>
</url>
</urlset>
i am looking at the date and priority tags, but would like to leave them out if it is ok?
SebastianX
06-12-2005, 01:05 PM
Looks good. You can validate your sitemap XML here:
http://www.smart-it-consulting.com/internet/google/submit-validate-sitemap/
benwalsh
06-13-2005, 02:31 AM
Success! i think? i have submitted to google, it went from pending to OK, in about an hour.
The present output from example.net/site-map.xml is: <?xml version="1.0" encoding="UTF-8" ?>
<urlset xmlns="http://www.google.com/schemas/sitemap/0.84" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.google.com/schemas/sitemap/0.84 http://www.google.com/schemas/sitemap/0.84/sitemap.xsd">
<url>
<loc>http://www.example.net/ubb/ultimatebb.php/topic/5/13862.html</loc>
<lastmod>2005-06-13</lastmod>
</url>
<url>
<loc>etc</loc>
<lastmod>2005-06-13</lastmod>
</url>
</urlset>
i intend to put my latest topics here, am i really doing ok?
what is the ideal number of links, or should i say url's, in these files?
should i make a list from my site maps, and list all pages in this format?
mickisdaddy
06-13-2005, 11:50 AM
I don't know if anyone has tried submitting a sitemap that points to multiple sitemaps, but I did and here is what happened.
I submitted the sitemap.xml that contains the location of 4 other sitemaps Each of the four contains probably 75,000 pages total, but I plan on it growing so I did separate sitemaps.
Googlebot downloaded the submitted sitemap within a few minutes after submitting it. The status page showed okay after an hour. Googlebot then spidered the original sitemap once the next day (yesterday) then again this morning. Then this morning Googlebot spidered the other four sitemaps from different ip addressed right about the same time.
I have not looked very close to see if some of the pages from the four sitemaps have been spidered, but it has only been a couple hours since the four sitemaps have been spidered.
I had to create my own script to create the sitemaps because of how the pages are made dynamicly and because a large portion of the site is not even in G's index, hence not visited much. It took me a little, but I got it down and the XML validated fine.
Hopefully this will help my site, especially since I redesigned my site and changed the structure a little bit and because only about a quarter of my site had been indexed before.
PhilC
06-13-2005, 12:01 PM
Google say that you shouldn't have more than 50,000 URLs in each site map - not 75,000.
Did your index sitemap state that it is an index to the actual sitemaps? If it's not done right, the other sitemaps will be seen as just pages/files, and not as sitemaps.
mickisdaddy
06-13-2005, 12:41 PM
I have 75,000 total between the four sitemaps
PhilC
06-13-2005, 01:38 PM
Ok. It was the wording that made it mean something different:-
Each of the four contains probably 75,000 pages total
benwalsh
06-14-2005, 08:17 PM
the sitemap.xml that contains the location of 4 other sitemaps mickisdaddy can you give me a sample or link to the xml file you are out putting to google
JohnHammer
06-14-2005, 08:49 PM
Here's a new, free Google Sitemap creator:
http://www.sitemapxml.com/
A major shortcoming of the other free tools out there is they don't find dynamically generated pages very well. That would be such as blogs, forums etc.
This one does. Only downside is the free version has a 200 page limit, but it also gives you an html version of your sitemap in addition to the Google XML compliant version.
Worth a try.
JH
Andy1969
06-16-2005, 07:12 AM
Any word of Yahoo and MSN accepting sitemaps.xml files?
Thanks
JohnHammer
06-16-2005, 11:55 AM
I have not heard of any, but in this field / market, I believe they are not far behind on this one.
PhilC
06-16-2005, 12:00 PM
If that's the case, it is to be hoped that all the major engines will adopt the same standard, even if it means Google modifying their system. Yahoo! adopted Google's "rel=nofollow" initiative, even though they prefered to do it differently. What the world really doesn't want is a "not invented here" attitude to be adopted, and it could easily happen.
Andy1969
06-16-2005, 12:04 PM
Doesn't seem like Danny is up to date yet ;)
http://forums.searchenginewatch.com/sitemap.xml :rolleyes:
dannysullivan
06-16-2005, 02:23 PM
http://blog.searchenginewatch.com/blog/sitemaps.xml :)
It's easier for me to do things directly on the blog. The forum and the main site, I have to get the developers involved. I did ask them to look into it back when it launched, but they get busy. Plus, we get indexed pretty well, so they probably aren't treating it as a priority.
rustybrick
06-16-2005, 06:17 PM
Has anyone seen any benefit from implementing Google sitemaps on sites that were already fully indexed?
PhilC
06-16-2005, 06:20 PM
It would be interesting if anyone said yes because there isn't supposed to be any benefit.
rustybrick
06-16-2005, 06:22 PM
Right, but regarding a "trust" factor of a site and how that might play in this new update. :rolleyes:
PhilC
06-16-2005, 06:28 PM
There's no "trust" factor that I'm aware of. When you get down to it, the only difference between the new system and normal sitemaps is that the new system swallows enormous files with an enormous number of URLs on them, and the normal sitemaps are limited to around 100k filesize. Other than that, there isn't really anything different.
rustybrick
06-16-2005, 06:30 PM
Those are my thoughts, but I have not tested to prove that there is or is not a trust factor.
JohnHammer
06-16-2005, 06:33 PM
If there is a benefit for fully indexed sites, it might be for new pages. New pages might get spidered more frequently. While Google indicates a frequency spec in the xml, it's unknown yet, as far as I know, whether there's much value in that (in other words, does google really pay attention to that field).
From what I see and know, for whatever reason, many dynamically created pages, i.e. from blogs, forums etc. don't get the spider coverage / indexing depth that one might think. Plus, if you have a really large site with lots of new forum topics, threads etc., how could you possibly know that all pages are indexed?
Seems to me to be well worth the time to do a sitemap just in case...
JH
PhilC
06-16-2005, 06:49 PM
I would be very interested to know if there is a benefit for dynamic pages that spiders can't normally see - those that are hidden behind forms (user selections, etc.). As far as we know, they will be treated as orphan pages and not indexed, or maybe they will make it into the Supplemental index and show up in the serps once in a while. I've seen nothing to suggest that they will be given any credit in lieu of IBLs, although the URLs on the Sitemaps page itself could be considered as links to them.
Of course, they won't be picked up by any Sitemaps generator, but website owners could add them by hand, and I'd be very interested to know their fate in this new system. They could be the type of pages that would actually benefit from it.
benwalsh
06-16-2005, 07:17 PM
I have over 100,000 forum topics and only a few hundred are presently found on google,
I have site maps to submit today that list all topics, and am hoping to see results soon. this should make a good test?
this google (http://www.google.com/search?q=+site:http://www.tubal-reversal.net/ubb/ultimatebb.php/topic/&hl=en&lr=&as_qdr=all&filter=0) search produces 199 results, i am hoping that by submiting site maps (example (http://www.tubal-reversal.net/site-map-forum1.xml)) that list all topics google will add them to its index
naphets66
06-16-2005, 08:43 PM
Just submitted a sitemap for one of my sites to Google. The site is dynamic with mod_rewrite. Knowing the structure, I created the map from my database in like 15 seconds. The site has around 8000 pages.
I noticed a lot of you mentioning the sitemap filename as sitemap.xml or sitemap.xml.gz. Mine is named sitemap.gz like it shows in google sitemaps help. I submitted it like this also. Is the filename required to be a certain format or does it matter? Google help seems kind of vague. I just wanted to know the varied formats used by people in the forum and their results.
Thanks
dazzlindonna
06-16-2005, 10:00 PM
Is the filename required to be a certain format or does it matter?
The file name doesn't matter.
Andy1969
06-17-2005, 05:32 AM
http://blog.searchenginewatch.com/blog/sitemaps.xml :)
It's easier for me to do things directly on the blog. The forum and the main site, I have to get the developers involved. I did ask them to look into it back when it launched, but they get busy. Plus, we get indexed pretty well, so they probably aren't treating it as a priority.
Only messing Danny ;)
On a different note though this sitemap.xml could be the end for mod_rewrite software do you think?
benwalsh
06-17-2005, 05:56 AM
is sitemap.gz a typo by google, i read elsewhere that sitemap.xml and sitemap.xml.gz are the correct extensions
i would use xml for under 150kb and compress .xml.gz for larger files
btw; might someone critique http://www.tubal-reversal.net/site-map-forum1.xml ?
PhilC
06-17-2005, 08:03 AM
On a different note though this sitemap.xml could be the end for mod_rewrite software do you think?No. mod_rewrite does a lot more stuff than what we often use it for, but it doesn't even spell the end for our main use of it. Sitemaps is a system of providing Google with a list of a site's URLs, and that's all. It isn't a system of getting those URLs into the index. Google say that the URLs are not guaranteed to be crawled or indexed, so, if there was a reason to use mod_rewrite before, it hasn't changed with Sitemaps.
Google will spider Sitemaps files if you tell them about it, but they don't always spider ordinary sitemap pages. Apart from that, and the fact that Sitemaps files can be huge, everything is as it was before - nothing has changed.
ThouShaltSeo
06-17-2005, 06:38 PM
I have to pay someone to do this for me. Should I wait and see how it settles, or do it now and pay again later to have it modified ;)?
what do you think?
Web Design Pros
06-17-2005, 07:19 PM
There is a new version written in Java that should be able to run on every OS including:
Linux
Unix
Windows
MacOS
Hi all,
We've made some more improvements to our
Google Sitemap Generator Tool (http://www.web-design-pros.ca/forum/viewforum.php?f=12)
Please register to use and provide us your feedback:
The new version has some user interface improvements.
It also creates google sitemaps for sites with just over 10,000 pages.
softplus
06-19-2005, 09:11 AM
Isn't it amazing how fast new products / tools show up after someone like Google offers a new service? Let's hope they keep adding things, this is fun :)
I also made a small windows-based Sitemap-Generator that crawls websites and generates the sitemap.xml files. In addition to the normal crawling, it can import log files (or URL lists) and can crawl Google site:-queries (to get whatever Google has already). It has many possibilities to strip parameters or parts of URLs (automatically get rid of Session-IDs, etc.). You can manage as many sites as you want (easily click between sites and versions of sites to try small changes), so far there is also no URL-limit.
Let me know what you think ;).
John
PS almost forgot the link: http://johannesmueller.com/gs/
Web Design Pros
06-21-2005, 10:27 PM
Hi all,
We've made some more improvements to our
Google Sitemap Generator Tool (http://www.web-design-pros.ca/forum/viewforum.php?f=12)
Please register to use and provide us your feedback:
The new version has some user interface improvements.
It also creates google sitemaps for sites with just over 10,000 pages.
We've updated our Google sitemap tool again.
Now it creates google sitemaps for sites with over 50,000 pages.
You can also save and reopen your projects.
We've also sped it up a bit.
Runs on all operating systems.
Gives you the opportunity to edit your url list before creating the sitemap.xml file.
GoogleGuy,
Do you have any comments for the members of this forum on the striking similarities between Google Sitemaps and ROR (Resources of a Resource, launched in late 2004 - http://www.rorweb.com)?
Thank you,
Dom
PhilC
08-08-2005, 12:18 PM
Sorry to bump this after so long but I have a question about submitting a sitemap...
I'd orginally thought that a sitemap URL could be pinged to Google using the URL they provide (www.google.com/webmasters/sitemaps/ping?sitemap=sitemap_url), but I've just been checking on Google's site and I see no way of doing it. The only thing I can find about submitting a sitemap is that it must be done by having a account in Google. After that, pinging is a method of resubmitting. Was I originally mistaken, or has it changed since then?
Has anyone tried pinging as the initial submission, and does it work?
SebastianX
08-08-2005, 12:40 PM
Change announced here:
http://groups-beta.google.com/group/google-sitemaps/browse_thread/thread/52fd8e610482518b/c3cf0a480d693d65?lnk=st&q=&rnum=2#c3cf0a480d693d65
They don't say you cannot initially submit via ping, so it still may work.
Google Employee Jul 7, 3:20 am
The Google Sitemaps team is pleased that so many of you have created
tools that generate Sitemaps based on the Google Sitemap protocol. As
we go through this beta period, we are continually looking for ways to
improve the process. One area we have been looking at is the submission
process. The first time you submit a Sitemap to Google (whether you
created it using the Sitemap Generator, a third-party tool, or manually
created it), please do so through your Google Sitemaps Account. This
lets us provide you with useful tracking and statistical information.
The My Sitemaps page lets you know if there are problems with your
Sitemap or with any of the URLs listed in it.
When you make changes to your Sitemap, you can resubmit it using your
Google Sitemaps Account or you can resubmit it using an HTTP request
(ping).
If you have created a tool based on the Google Sitemap protocol, please
tell your users to initially submit their Sitemaps using a Google
Sitemaps Account. We are modifying our Sitemap Generator to provide
this instruction to users as well.
We have also updated our FAQ and Sitemap Generator instructions to make
this clear. We have also submission information here:
http://www.google.com/webmasters/sitemaps/docs/en/submit.html
If you have any questions about the Sitemap submission process, please
post them here. We frequently update our FAQ to provide answers to any
questions you might have.
Thanks!
PhilC
08-08-2005, 12:49 PM
Many thanks!
So I wasn't mistaken originally. That casts a shadow over the whole thing, to my of thinking. I don't know what sort of shadow, but I'm not at all keen on an absolute requirement to register. It seems to devalue an idea that started out as an open system, like the robots.txt protocol.
But, as you said, pinging may still work without registering.
softplus
08-10-2005, 06:27 PM
Phil, YOU don't have to register, any old joe can do that for you. I think the idea behind the registering is not finding out who's behind a site (they can get to that anyway), but to have someone to send an error report to should something not work with your file. At the moment - when a site is registered, you can check the status, see when it was last checked and make sure it has the "OK" you want to see. If it doesn't, it'll show you +/- what you need to fix first.
If you don't want to register, pool together with a bunch of others, and open a "group account" for all your sites. Heck, throw in a bunch of sites that don't belong to any of you for that matter - it doesn't matter if a site is regestered more than once. Perhaps someone will open up an anonymous sitemap registration tool :-).
Best regards,
John
PhilC
08-10-2005, 06:33 PM
Hi John. I understand what you are saying. It's just that the absolute need to register now has the "feel" of Google sucking people in. Google is just a search engine but they do a number of things that attempt to interfere with the whole of the web. It's starting to have a feel of "bad" about it.
softplus
08-10-2005, 07:00 PM
Phil, I know where you're coming from -- just lately I had that bad feeling that I was being watched -- Google logging all my searches, of course.
Everyone, imagine for a moment, that it's not Google but Microsoft that was doing these things. How would the users / the media respond? "Personalisation" = marketingeze for "Targeting ads to best squeeze the rest out of your wallet" (perhaps a good thing for Adsense-users). As much as I love Google, it's slowly reaching the point where it's a bit too much. Time to open up a bunch of accounts for each service seperately. They're really "controlling" a lot of the web with everything you can get into with your Google account. Oh well, that's another topic ;) .
So what about it, an anonymous Google Sitemap-injector? Should be easy enough to do.
John
PhilC
08-10-2005, 07:20 PM
I haven't tried it yet, but I'll be pinging a sitemap to see if it still works.
intensity
12-30-2006, 01:26 PM
This is the easiest way of getting your blogs indexed. Simply provide a feed with all your posts. You don't have to trust google with all their apps, only with the ones that benefit you more than it benefits google.
Techyolk
01-03-2007, 10:58 AM
You're kidding me!
I think it might be important to just clarify that this does not help with ranking a web page. It just helps get your page indexed. Correct?
What are the other benefits?
U are right Mr. Rustybrick, I also found that google site helps in page index but not in ranking.
this is good for indexing the site
thenetspiders
01-13-2011, 04:18 AM
Google has opened a new Google Sitemaps (https://www.google.com/webmasters/sitemaps/) program allowing site owners to feed pages for inclusion in Google's web index. Participation is free, but inclusion isn't guaranteed. Google hopes the new system will help it better gather pages than traditional crawling alone allows. Feeds also let site owners indicate how often pages change or should be revisited.
On the SEW Blog, New "Google Sitemaps" Web Page Feed Program (http://blog.searchenginewatch.com/blog/050602-195224) has a Q&A on the new program with Shiva Shivakumar, engineering director and the technical lead for Google Sitemaps.
Still have more questions or comments? The Google Sitemaps team will be taking questions and responding in this thread.
It may be helpful for Websites owners
sannyhenry
01-22-2011, 06:03 AM
I tried it but did not get good results any one got success in it?