Protecting Your Websites From Search Engines


by Ben Cortese - Date: 2007-01-26 - Word Count: 504 Share This!

There are a great number of scenarios in which you should be protecting your websites from the search engines. If you've developed a website and you've developed a personal administrative piece to the site as well which many of us do, you may not want that administrative url showing up on Google or Yahoo search engines.

If you are accepting documents from clients in the form of a word document or pdf document and you're storing them in a directory on your web server, whether or not you have the section of the website password protected or not, the files on that web server are open in that directory simply because they reside on the web server relative to the root directory. And thanks to the power of Google, you can determine what word and pdf documents are available on a website by doing a simple search such as this:

google filetype:doc

Experiment with this format in a Google search, the results may surprise you. It surprised me once when I did a search for pdf documents for a bank website and found pdf documents that were designated for client eyes only, only I wasn't a client and I was able to view their documents.

The website itself was protected, and the area to access the documents was protected by username and password. What wasn't protected was the directory in which the institution was storing the documents. Adding insult to injury was not having done a common web practice in utilizing a robots.txt file in the root directory of their website.

There are a number of levels of security that simply must be in place. I'm certainly not a security expert, but I know that you have to protect your website from a network level, and from an application level as well. Your network operations team needs to do their part in locking down your directories, and installing patches and configuring and monitoring firewalls.

And developers need to do their part in securing their applications that require security. A very simple way to do that is by including a robots.txt file in the root directory of the web server. I'm not speaking of the root directory of the application, I'm talking about the web server.

The robots.txt file must reside in the root of the domain and must be named "robots.txt". A robots.txt file located in a subdirectory isn't valid, as bots only check for this file in the root of the domain. For instance, "http://www.example.com/robots.txt" is a valid location. But, "http://www.example.com/mysite/robots.txt" is not.

There are many variations to your robots.txt file and you have a lot of flexibility in what directories or files you want indexed by search engines, and those you do not. You do not use a robot.txt file to assure that pages are indexed with search engines, you use them to define what files and directories you do not want indexed by search engines. This is one simple but very important way to keep someone from finding the "For Your Eyes Only" documents relative to your company website.


Related Tags: google, search engines, robots.txt, robots

Ben Cortese is a developer and business analyst for the financial industry and enjoys developing websites through MerchantWeb Marketing.

Copyright 2007.

Your Article Search Directory : Find in Articles

© The article above is copyrighted by it's author. You're allowed to distribute this work according to the Creative Commons Attribution-NoDerivs license.
 

Recent articles in this category:



Most viewed articles in this category: