What You Should Know About Search Engine Bots
Back in the days around 3 B.G (Before Google) AltaVista was the new search engine on the block. In an effort to show off the power of their minicomputers, the AltaVista team at Digital decided to crawl and index the entire web. This was at the time a new concept. Many web masters didn't relish the idea of a "robot" program accessing every page on their web site as this would add more load to their web servers and increase their bandwidth costs. So in 1996 the Robots Exclusion Standard was created to address these web master concerns.
You can use a simple text file called robots.txt to keep search engines out of a directory. Here is a very simple example that will prevent all search engines (user-agents) from accessing the /images directory.
User-agent: * Disallow: /images
By disallowing /images you are also implicitly disallowing all subdirectories under /images, such as /images/logos and any files beginning with /images such as /images.html.
Strange enough, the first draft of this standard did not contain an "Allow" directive. Later on this has been added, yet without a guarantee of support by all search engines. This implies that anything not specifically disallowed has to be seen as a target for web crawlers.
To disallow access to your entire web site use a robots.txt like this:
User-agent: * Disallow: /
If User-agent is * then the following lines apply to all search engine robots. By specifying the signature of a web crawler as the User-agent you can give specific instructions to that robot.
User-agent: Googlebot Disallow: /google-secrets
Since the original spec was published several search engines have extended the protocol. One popular extension is to allow wildcards.
User-agent: Slurp Disallow: /*.gif$
This prevents Yahoo! (whose web crawler is called Slurp) from indexing any files on your site that end with ".gif". Keep in mind that wildcard matches are not supported by all search engines so you have to preface these lines with the appropriate User-agent line.
You can combine several of the above techniques in one robots.txt file. Here's a theoretical example.
User-agent: * Disallow: /bar User-agent: Googlebot Allow: /foo Disallow: /bar Disallow: /*.gif$ Disallow: /
Computer applications work great when it comes to following well defined instructions. The human brain however is less efficient at these functions, so the best advice is to keep things simple.
For us mortals there is a robots.txt analysis tool in Google's webmaster tools. Highly recommended. Another good resource for more information on the Robots Exclusion Standard is www.robotstxt.org.
Today when companies are spending a lot of money to be included in search engine listings, the idea of excluding your content may seem quaint. But from a security perspective there are many valid reasons for limiting what a search engine indexes on your site. See my Digital Security Report for more information.
Read more of Nick Dalton's Internet security articles on his blog for Internet business owners at TipsTricksToolsTechniques.com.
Published November 8th, 2007
Filed in Ecommerce




