PDA

View Full Version : I am a Yahoo loser...


lsredford
05-16-2005, 05:48 PM
I'm hoping admitting as such will be the first step to recovery. Presently, I (Diesel eBooks (http://www.diesel-ebooks.com) ) have around 64,000 pages indexed by Google and a paltry 143 by Yahoo. I've received great advice from this board, but any improvement always manifiests in my Google results. I've been optimizing pages, pounding the pavement getting links, and trying to overall create a spider friendly environment, but have failed miserably. For the record I'm in the Yahoo directory and use their PPC program.

Before you ask, I'm not in DMOZ (sore subject with me). I've given up after eight months of pleading and begging. Even tried to get an editor spot.

Here are two sample urls:

http://www.diesel-ebooks.com/cgi-bin/csearchpage/400/parent.sql/FIC028000
http://www.diesel-ebooks.com/cgi-bin/item/parent-1588733793

Any and all ideas are welcome and much appreciated.

Marcia
05-17-2005, 12:42 PM
I'm hoping admitting as such will be the first step to recovery.Well, let's hope you're on your way to recovery.

First thing fix this

http://www.diesel-ebooks.com/robots.txt

robots.txt is the first thing a respectable crawler goes after. There is an expected protocol - that looks to me like a duplicate of your homepage, which is not what you want to do with a custom 404 - particularly in that robots.txt file.

You might want to check with a higher power (http://www.robotstxt.org/wc/exclusion-admin.html) about this first thing. ;)

lsredford
05-17-2005, 05:06 PM
Thank you for your help, Marcia. Actually, we don't have a robots.txt file so it's just resolving to the home page. I will study your "higher power" link info and try to craft one.

Regards,

Scott

lsredford
05-18-2005, 01:36 PM
okay, I've been boning up on my robots.txt syntax but I'd like to bounce this off you before I start pasting away. About 8,000 of those 64,000 pages currently indexed are junk. For example, the "bot" follows a link on the book display page to a form to review the book. Nobody cares about a blank form. Also, I have a lot of cart pages indexed where the robot has actually added items to the cart. This is also a waste; however, these pages come from my cgi-bin which also happens to produce all of my product pages. Is there a way I can stop certain pages from the cgi-bin from being indexed?

For reference these two are samples of product display pages that I DO want indexed:
http://www.diesel-ebooks.com/cgi-bin/item/parent-0425189031
http://www.diesel-ebooks.com/cgi-bin/cbrowsepage/20/FIC022100

These are samples of the review and cart pages that don't need to be indexed:
http://www.diesel-ebooks.com/cgi-bin/writereview.cgi?item=0595764169&pass=0
http://www.diesel-ebooks.com/cgi-bin/Make-a-Store.cgi?item=5551314162

I came across a robots.txt creator via your "higher power" link and created this:

User-agent: *
Disallow: /http://www.diesel-ebooks.com/cgi-bin/writereview.cgi
Disallow: /http://www.diesel-ebooks.com/cgi-bin/Make-a-Store.cgi

If not through the robots.txt file, is there a way I can modify my templates so a robot doesn't following a certain link? (ex: to review a book)

I certainly appreciate your help and have already learned a lot.

Scott

Marcia
05-18-2005, 03:23 PM
Scott, this is beyond a robots.txt issue - though that showing up did concern how 404's are being handled, and you definitely don't want a bot pulling up a copy of your homepage when they go after robots.txt

This is an issue related to a problem with it being a dynamic site, not a Yahoo search or algo issue, so I've moved it over here to this forum and hopefully someone will come along who is knowledgeable about the more technical aspects of how to handle a dynamic site like yours.

mcanerin
05-18-2005, 03:53 PM
In answer to your first issue, I have a Robots.txt File Generator (http://www.mcanerin.com/search-engine/robots-txt.htm) that may be useful in conjuction with the "higher power" mentioned previously. :)

Regarding the more substantial issue of a dynamic website, you have several options:

1. Sign up for Yahoo's Search Submit service (http://searchmarketing.yahoo.com/srchsb/index.php) which kind of forces a spider to visit your site, and to try hard to see everything, rather than giving up if it has an issue.

Additionally, I once had a very tough Yahoo issue regarding bogons (http://forums.searchenginewatch.com/showthread.php?t=4030&highlight=Bogon) that the information from the search submit service was extremely helpful in helping fix, so it can also be used as a analytical tool in a pinch. ;)

2. The page I checked would not validate in the W3C checker due to a non-latin character in line 686, so I could not check it further than this. I don't think this is the problem (browsers are fairly robust) but I would check several representative pages an see if they validate, or that if they do not the reasons are technical, rather than fatal.

The URL structure itself doesn't look dynamic, even though it obviously is, since you have replaced the usual "?" and "=" with slashes, so I don't *think* it's a URL issue.

I'd have to look more closely for be certain, but in the meantime I would check out site submit and validate the pages (or otherwise check for errors).

Sorry I could not be more help.

Ian

mcanerin
05-18-2005, 04:02 PM
Wow, I think I just found the problem. If you load this page :http://www.diesel-ebooks.com/cgi-bin/item/parent-0425189031

into the real version of lynx (http://lynx.browser.org/) (not an online tool), and disallow cookies, it will respond with an alert claiming that the 404 page could not be found, followed by a request to set 5 (!!!) cookies, which I denied since that's what search engines do, then it blinks a few times and ends up showing a completly different page with almost no content on it.

I would start with that behaviour as the most likely suspect.

Ian

lsredford
05-18-2005, 06:01 PM
Ian, just saw this as I've been reading your bogon articles. Much of it is over my head, but I do understand the first sentence and it has me excited.

If I disable cookies in my firefox and load the site, I can see it fine. What is the relevance of lynx to this? Do you mean the spiders behaviour is similar to your linx tool so it helps in identifying a barrier? I have forwarded to my webhost and we will explore the potential barrier you have identified.

I've looked at Yahoo's search submit in the past, but at a $50 "review fee" for each url and then $.15 a click, it doesn't fit my cost model. I have tens of thousands of pages and they are mostly low ticket ebooks. I can handle the click charge, but the review fee is the deal killer.

One takeaway from this is we need to clean up the code in the templates.

Thank you very much.

Scott

lsredford
05-20-2005, 01:54 PM
Ian, we still can't figure out what you mean with the linx connection. Can you (or somebody else on this board?) please clarify your "just found the problem" post.

Thank you

Scott

Marcia
05-20-2005, 01:58 PM
Lynx is a text-only browser that sees a page like a search engine crawler does, so if Lynx has a problem, so do crawlers.

lsredford
05-20-2005, 05:16 PM
Thanks, Marcia. that helps. we pulled it up in a linux based lynx browser, disabled cookies and everything looked fine. I wish we could duplicate what he saw. Have you (or anybody else) tried pulling this up in lynx and had problems?

http://www.diesel-ebooks.com/cgi-bi...rent-0425189031


Thanks
Scott

lsredford
05-23-2005, 04:09 PM
anybody having problems pulling this up in a lynx browser?

http://www.diesel-ebooks.com/cgi-bi...rent-0425189031


Scott