Special thanks to:
|
#1
|
|||
|
|||
|
google site map xml file validation issues with URLs
Overview of Site Indexing issues
Hello there! I would appreciate any help you could give me on the following issues with a client site: www.kansassampler.com Background: dynamic site, over 8,000 URLs. Unfortunately, until a few weeks ago, their system uploaded a new set of URLs each day. In other words, for the majority of the site’s product and category/subcategory pages, the url query string would change every day. Why this was done, I have no clue. As of last week, they have resolved the issue and fixed the URLs so they don’t change. Example of problem they were/are having: the issue is that when you go into the google index and find one of their pages, one of two things will happen when you click on the listing (1)you get an error page saying the ‘id #1234 doesn’t exist’ or (2) you are taken to a page that has nothing to do with the cached page. For example run a query: site:www.kansassampler.com and view the listing that has the title of “Beverages” - on the first page, listing #9 as of today, 8/10/2007. If you view the URL you can see right away there is an issue – the URL is for some category having to do with gifts, books and CD’s (http://www.kansassampler.com/shopdisplayproducts.asp?id=205&cat=Gifts+%2FBooks+ %26+CDs). If you click on the ‘cached’ link for this listing, google indicates the following: “This is Google's cache of http://www.kansassampler.com/shopdis...ooks+ %26+CDs as retrieved on Jul 25, 2007 01:01:04 GMT.” But the cached content looks correct – showing various beverages. Other URLs seem hit or miss. Some look okay, others don’t. The client is asking how to get rid of the bad urls in the index. I told them the following options: (1) generate a 404 error page for any urls that don’t work any more or a 301 redirect. Their problem with that is that they don’t have any record of what the urls were before they fixed them! Right now, if you check the headers when you click on the following old/dead url (www.kansassampler.com/shopexd.asp?id=6044), you will see that they are doing a 302! How we have tried to fix the problem so far: #1 – fixed the URLs so they don’t change! #2 – created an xml site map file and submitted to google and yahoo webmaster accounts (having an issue with category xml file validation – see below please) #3 – will wait to see if all are working then start on seo of the pages (which are a mess to be sure) Google site map xml validation issues with category URLs I was able to quickly create an xml site map for the product urls using google’s python tool. Over 6,000 product urls in one xml file. The product url structure was fairly simple: http://www.kansassampler.com/shopexd.asp?id=142219. However, for the category URLs, the URL structure has multiple query parameters and even some bizarre ascii characters which I believe keep causing the python script, and any xml validation I do on the python script configuration xml file to throw an error – something to the effect of: Error: Expected ; after entity name, but got = in unnamed entity at line 61 char 78. I have attached the python script config file to this posting. But for those who are interested. Line 61 contains the one and only category URL I keep testing: <url href="http://www.kansassampler.com/shopdisplaycategories.asp?id=2&cat=KU+Collection" /> When I remove the ‘+collection’ part of the URL, I still get the same error (<url href="http://www.kansassampler.com/shopdisplaycategories.asp?id=2&cat=KU” “ />) When I really truncate the URL to simply: ?id=2, then everything works. I cant believe I cant generate valid XML with a complex URL??? What I would love your help with:
Thanks in advance for your help!!! |
|
#2
|
||||
|
||||
|
http://www.kansassampler.com/shopdis...ooks+ %26+CDs
The URL above is a live page so, that is the first problem. Check in Google Webmaster Tools to see what pages link to that URL from both the client's site (internal) as well as external links. I'm picking it up as an internal link from your client's site. 301 redirects need to be used and not 302 redirects. As far as the validation issue, replace the "&" symbol with "&" in the xml code and see if that works. To test, you can do this with a "find/replace" in the hard copy text version. Also xml should be utf-8. Getting rid of those %2s wouldn't be a bad idea either. ie <url> <loc>http://www.kansassampler.com/shopdisplayproducts.asp?id=207&cat=Coffee+and+ Tea</loc> <priority>0.5</priority> </url> <url> <loc>http://www.kansassampler.com/shopdisplayproducts.asp?id=208&cat=Cookies</loc> <priority>0.5</priority> </url> Your "errors" seem to come from: http://www.kansassampler.com/shopdis...asp?search=yes Please keep us posted! XML can be a pain in the rear! Last edited by beu : 08-10-2007 at 07:29 PM. |
|
#3
|
|||
|
|||
|
Re: google site map xml file validation issues with URLs
"Check in Google Webmaster Tools to see what pages link to that URL from both the client's site (internal) as well as external links..."
well, when i go to the webmaster tools, unfortunately it shows zippo for links, either internal or external. i know the site has internal links - it has over 7,000 pages! is that because google is sifting thru everything since i just submitted the product url xml file? also, why do you suggest this? what is the relevance of that to the issue of getting rid of these urls from the index? I assume i need to eliminate those links or google will continue to put in index right? "I'm picking it up as an internal link from your client's site." would you mind providing the query command you are using to find that? 301 redirects need to be used and not 302 redirects. yeah, i figured that but how do you recommend they do that if they dont have a record of all the 'bad' urls they have? i mean clicking manually on each listing in the index and doing it that way would be a bit slow.... "As far as the validation issue, replace the "&" symbol with "&" in the xml code and see if that works. To test, you can do this with a "find/replace" in the hard copy text version. Also xml should be utf-8. Getting rid of those %2s wouldn't be a bad idea either..." if i understand you, you say to replace those characters in the xml file with their actual ascii representations right? you are not talking about rewriting the urls are you? they probably wont want to do that... "Your "errors" seem to come from: http://www.kansassampler.com/shopdis...asp?search=yes" i am sorry. what do you mean here? so how do i get rid of all these old urls that dont work? (1) 404 (2) 301 redirect to a page that doesnt exist? (3) manually entering each url in the google webmaster url removal tool (uggghhh. 7,000 urls to remove!)?? hey thanks for your help? |
|
#4
|
||||
|
||||
|
Hey, no problem and I hope this is going to work for you!
![]() Quote:
- Here is how to encode an xml sitemap so that Google is able to read the URLs: http://www.google.com/support/webmas...y?answer=35653 Quote:
- When you have a list you should use it to help remove content from Google's index by using one of the following steps per URL or group of URLs: 1. Ensure requests for the page returns an HTTP status code of either 404 or 410. 2. Block the pages using a robots.txt file. 3. Block the pages using a meta noindex tag. - As soon as one of the above has been implemented you can submit a request for "New Removal" via Google Webmaster Tools. Quote:
- As for using a 301 or 302 redirect, I'm not sure how you plan to use either one. I was simply saying that a 302 redirect is a temporary redirect. In many cases, URLs with 302 redirects remain in Google's index. A 301 would be one way to perhaps help flush out old URLs. Either way, don't use a redirect or 404 on URLs disallowed via robots.txt. Last edited by beu : 08-11-2007 at 05:07 PM. |
![]() |
| Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
| Thread Tools | |
|
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Help-Stop-Smoking Site Rankings Drop Out Of Google - Non-Profit site - Please Help! | bobmutch | Google Web Search | 75 | 06-15-2007 06:40 PM |
| After 2 succesfull days in Google, site suddenly dissapears | lieven | Search Engine Optimization | 3 | 04-24-2007 03:01 PM |
| Looking for all the help I can get with rehabbing my site. | tonerman | Search Engine Optimization | 1 | 12-06-2006 10:15 PM |
| Optimizing Flash files for the search engines | rockcoastmedia | Search Engine Optimization | 30 | 04-11-2006 08:40 PM |
| Site not ranking in GOOGLE | BluEnt | Google Web Search | 1 | 11-14-2005 03:30 PM |