#1  
Old 08-10-2007
outofbounds outofbounds is offline
outofbounds
 
Join Date: Feb 2005
Posts: 17
outofbounds is on a distinguished road
google site map xml file validation issues with URLs

Overview of Site Indexing issues
Hello there! I would appreciate any help you could give me on the following issues with a client site: www.kansassampler.com

Background:

dynamic site, over 8,000 URLs. Unfortunately, until a few weeks ago, their system uploaded a new set of URLs each day. In other words, for the majority of the site’s product and category/subcategory pages, the url query string would change every day. Why this was done, I have no clue. As of last week, they have resolved the issue and fixed the URLs so they don’t change.
Example of problem they were/are having: the issue is that when you go into the google index and find one of their pages, one of two things will happen when you click on the listing (1)you get an error page saying the ‘id #1234 doesn’t exist’ or (2) you are taken to a page that has nothing to do with the cached page.
For example run a query: site:www.kansassampler.com and view the listing that has the title of “Beverages” - on the first page, listing #9 as of today, 8/10/2007. If you view the URL you can see right away there is an issue – the URL is for some category having to do with gifts, books and CD’s (http://www.kansassampler.com/shopdisplayproducts.asp?id=205&cat=Gifts+%2FBooks+ %26+CDs). If you click on the ‘cached’ link for this listing, google indicates the following: “This is Google's cache of http://www.kansassampler.com/shopdis...ooks+ %26+CDs as retrieved on Jul 25, 2007 01:01:04 GMT.” But the cached content looks correct – showing various beverages.
Other URLs seem hit or miss. Some look okay, others don’t.

The client is asking how to get rid of the bad urls in the index. I told them the following options: (1) generate a 404 error page for any urls that don’t work any more or a 301 redirect. Their problem with that is that they don’t have any record of what the urls were before they fixed them! Right now, if you check the headers when you click on the following old/dead url (www.kansassampler.com/shopexd.asp?id=6044), you will see that they are doing a 302!

How we have tried to fix the problem so far:
#1 – fixed the URLs so they don’t change!
#2 – created an xml site map file and submitted to google and yahoo webmaster accounts (having an issue with category xml file validation – see below please)
#3 – will wait to see if all are working then start on seo of the pages (which are a mess to be sure)

Google site map xml validation issues with category URLs
I was able to quickly create an xml site map for the product urls using google’s python tool. Over 6,000 product urls in one xml file. The product url structure was fairly simple: http://www.kansassampler.com/shopexd.asp?id=142219.
However, for the category URLs, the URL structure has multiple query parameters and even some bizarre ascii characters which I believe keep causing the python script, and any xml validation I do on the python script configuration xml file to throw an error – something to the effect of: Error: Expected ; after entity name, but got = in unnamed entity at line 61 char 78.

I have attached the python script config file to this posting. But for those who are interested. Line 61 contains the one and only category URL I keep testing:
<url href="http://www.kansassampler.com/shopdisplaycategories.asp?id=2&cat=KU+Collection" />
When I remove the ‘+collection’ part of the URL, I still get the same error (<url href="http://www.kansassampler.com/shopdisplaycategories.asp?id=2&cat=KU” “ />)
When I really truncate the URL to simply: ?id=2, then everything works.
I cant believe I cant generate valid XML with a complex URL???

What I would love your help with:
  1. Am I taking the right steps to fix the URL indexing problem?
  1. I fixed the URLs, then submitted xml site maps. Any other thoughts?
  1. How do I remove the old URLs that are still in the index?
Do we just wait? Is my idea of generating 404 errors correct?
Thanks in advance for your help!!!
Attached Files
File Type: txt categoryXMLConfigFile.txt (3.6 KB, 106 views)
Reply With Quote
  #2  
Old 08-10-2007
beu's Avatar
beu beu is offline
 
Join Date: Sep 2004
Location: Atlanta, GA U.S.A.
Posts: 2,192
beu is a name known to allbeu is a name known to allbeu is a name known to allbeu is a name known to allbeu is a name known to allbeu is a name known to all
Wink Re: google site map xml file validation issues with URLs

http://www.kansassampler.com/shopdis...ooks+ %26+CDs

The URL above is a live page so, that is the first problem.

Check in Google Webmaster Tools to see what pages link to that URL from both the client's site (internal) as well as external links. I'm picking it up as an internal link from your client's site.

301 redirects need to be used and not 302 redirects.

As far as the validation issue, replace the "&" symbol with "&amp;" in the xml code and see if that works. To test, you can do this with a "find/replace" in the hard copy text version. Also xml should be utf-8. Getting rid of those %2s wouldn't be a bad idea either.

ie
<url>
<loc>http://www.kansassampler.com/shopdisplayproducts.asp?id=207&amp;cat=Coffee+and+ Tea</loc>
<priority>0.5</priority>
</url>
<url>
<loc>http://www.kansassampler.com/shopdisplayproducts.asp?id=208&amp;cat=Cookies</loc>
<priority>0.5</priority>
</url>


Your "errors" seem to come from: http://www.kansassampler.com/shopdis...asp?search=yes

Please keep us posted! XML can be a pain in the rear!

Last edited by beu : 08-10-2007 at 07:29 PM.
Reply With Quote
  #3  
Old 08-11-2007
outofbounds outofbounds is offline
outofbounds
 
Join Date: Feb 2005
Posts: 17
outofbounds is on a distinguished road
Re: google site map xml file validation issues with URLs

"Check in Google Webmaster Tools to see what pages link to that URL from both the client's site (internal) as well as external links..."

well, when i go to the webmaster tools, unfortunately it shows zippo for links, either internal or external.
i know the site has internal links - it has over 7,000 pages! is that because google is sifting thru
everything since i just submitted the product url xml file? also, why do you suggest this?
what is the relevance of that to the issue of getting rid of these urls from the index? I assume i need to eliminate those links or
google will continue to put in index right?

"I'm picking it up as an internal link from your client's site."

would you mind providing the query command you are using to find that?

301 redirects need to be used and not 302 redirects.

yeah, i figured that but how do you recommend they do that if they dont have a record of all the 'bad' urls they have? i mean clicking manually on each listing in the index and doing it that way would be a bit slow....

"As far as the validation issue, replace the "&" symbol with "&amp;" in the xml code and see if that works. To test, you can do this with a "find/replace" in the hard copy text version. Also xml should be utf-8. Getting rid of those %2s wouldn't be a bad idea either..."

if i understand you, you say to replace those characters in the xml file with their actual ascii representations right? you are not talking about rewriting the urls are you? they probably wont want to do that...

"Your "errors" seem to come from: http://www.kansassampler.com/shopdis...asp?search=yes"

i am sorry. what do you mean here?

so how do i get rid of all these old urls that dont work? (1) 404 (2) 301 redirect to a page that doesnt exist? (3) manually entering each url in the google webmaster url removal tool (uggghhh. 7,000 urls to remove!)??


hey thanks for your help?
Reply With Quote
  #4  
Old 08-11-2007
beu's Avatar
beu beu is offline
 
Join Date: Sep 2004
Location: Atlanta, GA U.S.A.
Posts: 2,192
beu is a name known to allbeu is a name known to allbeu is a name known to allbeu is a name known to allbeu is a name known to allbeu is a name known to all
Thumbs up Re: google site map xml file validation issues with URLs

Hey, no problem and I hope this is going to work for you!

Quote:
if i understand you, you say to replace those characters in the xml file with their actual ascii representations right? you are not talking about rewriting the urls are you? they probably wont want to do that...
- Yes, replace those characters in the code of the xml sitemap only. Chances are this is why you cant see links in your Google Webmaster Tools account.

- Here is how to encode an xml sitemap so that Google is able to read the URLs:
http://www.google.com/support/webmas...y?answer=35653

Quote:
also, why do you suggest this? what is the relevance of that to the issue of getting rid of these urls from the index? I assume i need to eliminate those links or google will continue to put in index right?
- As soon as you identify and have a full list of "bad" URLs and/or URLs you want to remove from Google's index you can work on taking actions to remove those links from Google's index. Hopefully you can use Google Webmaster Tools to compile a major portion of that list.

- When you have a list you should use it to help remove content from Google's index by using one of the following steps per URL or group of URLs:
1. Ensure requests for the page returns an HTTP status code of either 404 or 410.
2. Block the pages using a robots.txt file.
3. Block the pages using a meta noindex tag.

- As soon as one of the above has been implemented you can submit a request for "New Removal" via Google Webmaster Tools.


Quote:
would you mind providing the query command you are using to find that?
- There is no query, I used a tool to crawl your client's site and then to map your client's site. The tool found the URL in question in your client's site. Also during that process the tool "got hung" on http://www.kansassampler.com/shopdis...asp?search=yes which usually indicates errors.

- As for using a 301 or 302 redirect, I'm not sure how you plan to use either one. I was simply saying that a 302 redirect is a temporary redirect. In many cases, URLs with 302 redirects remain in Google's index. A 301 would be one way to perhaps help flush out old URLs. Either way, don't use a redirect or 404 on URLs disallowed via robots.txt.

Last edited by beu : 08-11-2007 at 05:07 PM.
Reply With Quote
Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Help-Stop-Smoking Site Rankings Drop Out Of Google - Non-Profit site - Please Help! bobmutch Google Web Search 75 06-15-2007 06:40 PM
After 2 succesfull days in Google, site suddenly dissapears lieven Search Engine Optimization 3 04-24-2007 03:01 PM
Looking for all the help I can get with rehabbing my site. tonerman Search Engine Optimization 1 12-06-2006 10:15 PM
Optimizing Flash files for the search engines rockcoastmedia Search Engine Optimization 30 04-11-2006 08:40 PM
Site not ranking in GOOGLE BluEnt Google Web Search 1 11-14-2005 03:30 PM


All times are GMT -4. The time now is 09:12 PM.