Search Engine Watch
SEO News

Go Back   Search Engine Watch Forums > Search Engine Marketing Strategies > Search Engine Optimization > Dynamic Website and Technical Issues
FAQ Members List Calendar Forum Search Today's Posts Mark Forums Read

Reply
 
Thread Tools
Old 11-16-2004   #1
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation Robots.txt & Security Issues

SECURITY RISKS ASSOCIATED WITH ROBOTS.TXT FILES


A search for robots.txt in the SEW Forums returned about 50 threads. In my opinion, the most relevants are

Robots.txt Generator
404 errors and robots.txt file
Let's discuss ROBOTS.TXT
robots.txt page inside root directory
Mod Rewrite for robots.txt
Robots.txt files and excluding variables?

These threads provide excellent posts on the use of robots text files as resource tools in web properties. However, non of them discuss the security risks that could be associated with robots text files. While this subject is nothing new, I feel it won’t hurt if we revisit the risks associated with these files.

Back in 2001 when Google introduced file-specific searches savvy users and hackers realized that these type of searches could be used to discover security holes and passwords. This CNET LEGACY article from those days mention briefly the security risks associated with robots text files. In a nutshell

1. A robots.txt file can only succeed with search-compliant bots.
2. Malicious crawlers do not have to follow the robots exclusion protocol. (A ban list in .htaccess performs a bit better against these if you know who’s your enemy).
3. What is worst, the robots.txt file acts as nice advertisement to hackers that valuable or sensitive information could be ahead in a particular directory path for a particular reason.

Savvy visitors with their own personal search tools can try to datamine the discovered directory paths.

Christopher Klaus, founder and chief technology officer of Internet Security Systems was quoted in the above CNET legacy article as saying, “…a robots.txt file could be a flag for intruders to say, this must be interesting if robots are being told not to look at it.”

Very true then and now.

Indeed, when conducting intel, one of the first thing someone would check is whether the target web property advertises a robots.txt file and what could be ahead of the forbidden paths.

Robots.txt files unnecessarily facilitate hackers and non hackers with valuable information since

1. anyone with a browser can see them using "http://www.xyz.com/robots.txt", where xyz is the target domain name.
2. hackers and users can have a crude idea of the web property directory tree
3. a notion of the size, coverage, extension of the web property is facilitated
4. the multilevel nature of the architecture employed is facilitated
5. lazzy admins can leave behind unused or old paths that could be mined.

There are some free tools than can harvest “forbidden” paths in a snap.

Let’s discuss security risks associated to robots.txt files and how we can overcome these.

Disclaimer: The purpose of this thread is to encourage the discussion of security issues associated with robots.txt files, not to promote malicious practices.



Orion

Last edited by orion : 11-16-2004 at 10:13 PM.
orion is offline   Reply With Quote
Old 11-16-2004   #2
mcanerin
 
mcanerin's Avatar
 
Join Date: Jun 2004
Location: Calgary, Alberta, Canada
Posts: 1,564
mcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond reputemcanerin has a reputation beyond repute
That's a good point, Orion.

I'm going to add it to the info on my own site, and in addition throw out the following ideas and caveats for discussion here.

Naturally, a robots.txt file can only be used for intel gathering IF it's used to exclude certain directories. A robots.txt that allows/refuses all or only addresses certain well known directories is not going to be a security issue.

For example, this:

Code:
 user-agent: * 
disallow: /
Tells a visitor nothing about your site (other than you don't want it to be spidered).

But this:

Code:
 user-agent: * 
disallow: /secretsauce/
Can indicate that there is a directory called "secret sauce" and that you don't want it spidered. If the web owner was foolish enough to not put in any further security then you have just issued an invitation to enter to anyone interested.

In essence, it's a site map to things you want hidden from search engines.

A more effective way of spider control is to use an "open" robots.txt and then use the robots and pragma tags in the headers of specific pages that you don't want spidered. These override the open invitation of the robots.txt and provide the best of both worlds at the cost of having to keep track of individual pages.

The best method is to simply use the robots.txt to control spider behaviour for documents that are otherwise open to the public (ie PPC landing pages, duplicate content, etc) and then use REAL security for anything that you actually want hidden.

If you require a password to a directory or page for people, then a spider won't get in, permission in the robots.txt notwithstanding.

Ian
__________________
International SEO
mcanerin is offline   Reply With Quote
Old 11-16-2004   #3
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

Hi, mcanerin

Quote:
Originally Posted by mcanerin
Naturally, a robots.txt file can only be used for intel gathering IF it's used to exclude certain directories.
True, micanerin. Robots.txt files were meant to be used to prevent robots indexing specific data, not to protect data.
Chami.com puts in the right perspective the use of robots.txt

Do not use the robots.txt file to protect or hide information.
Misunderstanding and misuse of any technology or tool, including paperclips, can be a security risk.”



On the other hand, robots.txt files could lead to what I call “secondary risks” or “leading risks”. There may be nothing wrong in a path, but the path itself could lead to unnecessary troubles. Often, with large entities that have huge sites (universities, hospitals, non-profit organizations, foreign government sites, etc) the advertising of the architecture of a domain and its directory tree could be a real issue.


Orion

Last edited by orion : 11-16-2004 at 10:09 PM.
orion is offline   Reply With Quote
Old 11-17-2004   #4
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation On Risks and Hopes

In this post I discuss the risks associated with advertising directory paths and an approach one could try with non human visitors.


Directory Paths and Search Tools

Xenu is a free search tool (not a search engine crawler) designed to, among other things, identify/extract links, email addresses, application files, external files, url name=value pairs, images, in/out links, etc.

The tool requires human input; i.e. the user must specify an initial seed, which can be a domain address ("http://www.xyz.com") or a directory path ("http://www.xyz.com/abc/…"). Xenu then crawls all links associated to the initial seed.

A user that don’t know which path to mine only need to check the robots.txt file of the target web property. If the property has a robots.txt file facilitating directory paths, then he/she only needs to pick and enter into Xenu a seed path, sit back and wait. This could lead to the discovery of new paths not specified in the robots.txt file. Great!


Directory Paths and Synchronized Crawls

Another problem with letting others know about directory paths consists in that malicious users can synchronize concerted and continuous (recursive) crawls to each path. Can you guess the outcome of this?


Embracing a Partial Hope: Robots.txt Rewrites

A 2002 WMW thread describes a procedure for rewriting robots.txt files and avoiding unwanted non human visitors. The prescribed procedure consists in banning all robots and allowing the good ones with the following script

RewriteCond %{HTTP_USER_AGENT} ^(Mozilla|Opera)
RewriteCond %{HTTP_USER_AGENT} !(Slurp|surfsafely)
RewriteRule ^robots\.txt$ /someotherfile [L]
where "someotherfile" could be a fake robots.txt or a blank file.

A Line-by-line explanation is given below.

Line 1: IF the User-agent string starts with "Mozilla" or “Opera”, do this rewrite.

Line 2: AND IF the User-agent string does not contain "Slurp" or "surfsafely" (i.e., Two 'bots w/UAs that start with "Mozilla"), do this rewrite.

Line 3: THEN do the rewrite of robots.txt to “someotherfile”.

All robots not identified by lines 1 and 2 will be served with the fake file. Feel free to modify this script to your heart needs.
This solution works with some non human visitors...


Orion

Last edited by orion : 11-17-2004 at 01:35 PM. Reason: adding last line, typos
orion is offline   Reply With Quote
Old 11-18-2004   #5
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation On Leading Security Risks

DIRECTORY PATHS AS LEADING SECURITY RISKS

Hackers long ago realized that there are two type of security risks; i.e. (a) inherent and (b) leading. While the former are often obvious, the latter are not. A leading security risk is one that simplifies or enables the commission of malicious intents. Advertising for no reason directory paths often qualifies as leading or enabling risks.

ENTERPRISE INTRANETS

To illustrate, consider the case of intranets. By definition and intranet is an isolated network architecture. A cardinal rule is to never, ever give access to an enterprise intranet through the Internet. In my view, once you do this and someone can access it with a browser, it is no longer an intranet.

Surprisingly, too often one can see large web properties belonging to university sites and to private and public sites facilitating intranet access through the Internet. Often this is the least and simplest network architecture solution. At the same time, this is an invitation to troubles and an accident waiting to happen.

MURPHY'S LAW

Never, ever, put online a research intranet or enterprise resources reserved for internal purposes thinking that its directory path will never be found. Murphy’s Law rules: if something can go run it will. More than often one can find “hidden” or “buried” directory paths explicitly facilitating transactions or name=value pair information with

Users or client id names
Password tokens
Ip and email addresses
Login patterns

And the list goes on.

Thus, poor directory tree configurations or worse, facilitating directory trees to the general public is risky business. This is often the case of poorly written

online stores
discussion forums
script resources
online databases
site search tools
cgi mechanisms, etc.

Often upgrading or installing new resources following default parameters or that append new directories to your root directory could expose new security holes, fully accessible through a browser.

Don't Trust "Hidden" Paths

Too often, to access or discover sensitive paths or recently appended paths one just need a seed path found in a robots.txt file, mine the seed, keep looking ahead for new “hidden” paths, etc. This procedure often leads to the discovery of true or inherent security risks.

Many think that undocumented Web servers or undocumented paths and directory trees could be used free of risks or that these will never, ever, be discovered by someone. Wrong! Lamo’s Adventures in WorldCom should convince anyone.. But, for those that still are not convinced, let’s revisit this Kevin Poulsen’s report

“Web applications are an overlooked chink in many organizations' network security armor, Lamo explains. Sometimes, the weakness is an improperly configured Access Control List (ACL) that allows anyone on the Internet to visit an application that should be restricted. Other times, network administrators deliberately leave secret Web page wide open, counting on nobody stumbling across the URL.”

”Lamo is a master of this unlisted Web. He can direct you to the Web site at Apple Computer that yields a trove of detailed circuit diagrams and schematics, marked "proprietary," but available to anyone with knowledge of the URL. He knows a particular Web address at the prestigious Journal of Commerce (JoC) that routes to an unprotected administrative tool that grants access to the publication's database of online subscribers, their names, email addresses and passwords.”

The main thesis of this thread is not about fueling paranoia or being alarmist but about promoting security awareness when using robots text files. Often leading risks are your real enemy. (Remember Murphy’s Law).


Sometimes History teaches us good lessons. Here are some stories

Hacker finds fault in .Net security. Describes how poorly configured defaults can hurt you.

He Hacks by Day, Squats by Night . A two-page story about Lamo, misconfigured proxies and stubborn executives.

More on Lamo, Poulsen and Mitnick With pictures.


Orion

Last edited by orion : 11-18-2004 at 12:21 PM. Reason: Adding material
orion is offline   Reply With Quote
Old 11-20-2004   #6
Chris_D
 
Chris_D's Avatar
 
Join Date: Jun 2004
Location: Sydney Australia
Posts: 1,099
Chris_D has much to be proud ofChris_D has much to be proud ofChris_D has much to be proud ofChris_D has much to be proud ofChris_D has much to be proud ofChris_D has much to be proud ofChris_D has much to be proud ofChris_D has much to be proud ofChris_D has much to be proud of
"Security via obsurity" has never been a solid strategy for maintaining a secure location on the web - even less so when you identify the obsure file location!!

Chris_D is offline   Reply With Quote
Old 11-21-2004   #7
Papadoc
Member
 
Join Date: Jun 2004
Posts: 79
Papadoc is a jewel in the roughPapadoc is a jewel in the roughPapadoc is a jewel in the rough
Interesting... JoC must have been tipped and done a quick review of some security. Their robots file also disallows the enews folder. From what I've seen elsewhere, this is where they publish their subscribed newsletters.

What they evidently didn't consider is that doing a robots disallow and having a convoluted naming convention doesn't do a thing if your subscribers then post links to this private area from their site.

It seems as though there has been a quick rearrangement of files and folders as every one of these links now comes up 404.
Papadoc is offline   Reply With Quote
Old 11-21-2004   #8
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation More Leading Risks

CGI Risks

From time to time upgrades, new applications, add-ons or making modifications can result in the appending of new paths web administrators or webmasters may not be aware of.

The following example is taken from the W3C Security Faq pages
(http://www.w3.org/Security/Faq/wwwsf4.html)

“Consider the following scenario. For convenience's sake, you've decided to identify CGI scripts to the server using the .cgi extension. Later on, you need to make a small change to an interpreted CGI script. You open it up with the Emacs text editor and modify the script. Unfortunately the edit leaves a backup copy of the script source code lying around in the document tree. Although the remote user can't obtain the source code by fetching the script itself, he can now obtain the backup copy by blindly requesting the URL: "http://your-site/a/path/your_script.cgi~"

(This is another good reason to limit CGI scripts to cgi-bin and to make sure that cgi-bin is separate from the document root.)”

End of the quote.

Someone scouting for directory paths could find the appended path, effectively compromising your system.


Possible Solution: Revisit Your Paths

It might be a good idea to from time to time do a full check of your root directory and web property tree structure. With large sites such as universities, government agencies, huge corporations, this is a must.

Start with the most obvious source: your robots.txt file(s). Use Xenu or a similar tool on your system to revisit/rediscover all your directory paths found. Try also to see if you can grab stuff someone is not suppose to grab.

If you can find an anomaly, hidden paths or paths that compromise sensitive information, chances are that others blindly requesting paths –initially using a robots.txt file or systematically targeting your Web trees- can find them, too.


Orion

Last edited by orion : 11-21-2004 at 11:58 AM. Reason: typos
orion is offline   Reply With Quote
Old 11-21-2004   #9
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation Directory Listings and Symbolic Links

In a sense, the following information relates to my previous post on the leading or enabling risk of appending paths to trees. I am taking this from the Security Faqs of the W3C site, http://www.w3.org/Security/Faq/wwwsf3.html and I quote (emphasis added)

Automatic directory listings

"Knowledge is power and the more the remote hacker can figure out about your system the more chance for him to find loopholes. The automatic directory listings that the CERN, NCSA, Netscape, Apache, and other servers offer are convenient, but have the potential to give the hacker access to sensitive information. This information can include: Emacs backup files containing the source code to CGI scripts, source-code control logs, symbolic links that you once created for your convenience and forgot to remove, directories containing temporary files, etc."

"Of course, turning off automatic directory listings doesn't prevent people from fetching files whose names they guess at. It also doesn't avoid the pitfall of an automatic text keyword search program that inadvertently adds the "hidden" file to its index. To be safe, you should remove unwanted files from your document root entirely."


Symbolic link following

"Some servers allow you to extend the document tree with symbolic links. This is convenient, but can lead to security breaches when someone accidentally creates a link to a sensitive area of the system, for example /etc. A safer way to extend the directory tree is to include an explicit entry in the server's configuration file (this involves a PathAlias directive in NCSA-style servers, and a Pass rule in the CERN server)."

"The NCSA and Apache servers allows you to turn symbolic link following off completely. Another option allows you to enable symbolic link following only if the owner of the link matches the owner of the link's target (i.e. you can compromise the security of a part of the document tree that you own, but not someone else's part)."

End of the quote.

It might be a good idea to check directory paths in Temporary Files.

Orion
orion is offline   Reply With Quote
Old 11-26-2004   #10
Nacho
 
Nacho's Avatar
 
Join Date: Jun 2004
Location: La Jolla, CA
Posts: 1,382
Nacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to beholdNacho is a splendid one to behold
This is an excellent thread, Orion. Here is another great resource:

Search Engine Spider Identification >> WebmasterWorld forum11 Updated and Collated Bot List >> Latest Update on #16 & #17

Saludos!
Nacho is offline   Reply With Quote
Old 11-27-2004   #11
orion
 
orion's Avatar
 
Join Date: Jun 2004
Posts: 1,044
orion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to beholdorion is a splendid one to behold
Exclamation

Thanks, Nacho.

Old but handy resource. I couldn't see government bots listed (Mako and carnivore's kids)

Orion
orion is offline   Reply With Quote
Old 02-22-2006   #12
wirehopper
 
Posts: n/a
robots.txt tester

This is a free utility that spiders a site (home page + 1 page deep) and reads the robots.txt file. It flags robots.txt disallows that aren't referenced in the pages scanned.

http://www.wirehopper.com/robots/ref/index.php

It bundles all the disallows into one set (disregards the user-agent directives). The goal is to check if the robots.txt file is divulging paths that aren't referenced through the site anyway.
  Reply With Quote
Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off