PDA

View Full Version : spidering https or secured sites


yuwyma
01-05-2006, 07:37 PM
Can anyone can confirm if yahoo spider can index https or secured web sites (no logins)?

Chris_D
01-05-2006, 08:07 PM
Hi and welcome to SEW Forums!

Yahoo! does not index https:// pages (port 443 secure).

There are several ways to prevent our crawler from indexing your site or portions of your site:

- create a "robots.txt" file on your web site to prevent our crawler from indexing your site
- add a "noindex" meta tag to your documents
- remove the original document from your web site
- host the document on a secure section of your web site (HTTPS or login)
http://help.yahoo.com/help/us/ysearch/deletions/deletions-03.html

yuwyma
01-05-2006, 08:16 PM
Hi and welcome to SEW Forums!

Yahoo! does not index https:// pages (port 443 secure).


http://help.yahoo.com/help/us/ysearch/deletions/deletions-03.html

Hi Chris, thanks for the reply. That's what I saw on their help page, but I have received conflicting information. I contacted someone from Yahoo and still waiting for confirmation. I am also trying to look at the access logs for the yahoo spider but unforunately it will take a few days for me to have access.

Chris_D
01-05-2006, 11:02 PM
I think you'll find that Yahoo! will generally only 'index' a secure page, where the request for a non secure variant uses a 302 temporarily moved to redirect to the secure variant.

e.g. Try this search:

http://search.yahoo.com/search?p=https%3A%2F%2Fwww.bnz.co.nz%2FInternet_Ba nking%2F1%2C1184%2C10-144-579%2C00.html

Now try this search:

http://search.yahoo.com/search?p=http%3A%2F%2Fwww.bnz.co.nz%2FInternet_Ban king%2F1%2C1184%2C10-144-579%2C00.html

The http version of the page gets indexed. Remember - Yahoo doesn't show http:// in the serps.

The http pages gives a 302 redirection to the https page (which means - index the requested URL {http://} - but with the content of the 302 target page).

The http page therefore gets indexed with the content of the https:// page. If you click on the link in the serps you'll get 302'd to the secure page - making it appear that the secure page itself was indexed.

PM me the url you are inquiring about & I'll have a look.

Also - many other https:// pages also use robots.txt or meta robots where they don't want a site indexed.