Robots.txt

From Acenet Knowledgebase
Jump to: navigation, search

Search engine optimization (SEO) is an important thing in this day of information technology. Most search engines utilize site crawlers, spiders, or simply just bots to ensure their searches find the proper results. However, not all bots are beneficial and sometimes even the ones that are can cause issues on your site.

This article will explain how a robots.txt file within your account can help you manage the bots that crawl your site. The first things to understand are the options you can put into your robots.txt file:

robots.txt directives

User-agent

User-agent:

This specifies the bot you want to pass instructions to. This should be the first line in your robots.txt file. You can either specify all bots using the wildcard symbol * or you can name them individually (bing, yahoo, etc).

Disallow

Disallow:

This is the line you will most commonly follow the previous option with. Following this option, you will designate a folder or file that you do not want to be 'crawled'. For example, you can specify simply / to prevent any of your files from being 'crawled' or you can get more specific and designate something like /images/image.jpg to stop that specific image file from being 'crawled'.

Allow

Allow:

This line will specify any exceptions to the disallow rule that you want certain bots to follow. Note that only the major search bots follow the allow option overide, so Bing, Yahoo, and etc.

Crawl-delay

Crawl-delay:

This is the most commonly used option in a robots.txt file as it slows down the rate at which bots hit your site. This will prevent bots from consuming all the resources available for your site as they continue to index it. After the option, you will place the amount of seconds that you want to delay the bots by.

Example robots.txt

So, lets say we configured a robots.txt file with the following options:

User-Agent: *
Allow: /
Disallow: /administrator
Disallow: /cgi-bin

That robots.txt file would allow all folders but the administrator and cgi-bin folders to be crawled and indexed by all search engines.

These options will help you to control the traffic of bots that access your site.

Googlebot is also controlled by "Google Webmaster Tools" which you can sign up/access here:

http://www.google.com/webmasters/tools/

It is also important to know that malicious bots do not adhere to the robots.txt file. You would need to block access from any malicious bots. If you have access to cPanel, you can block the malicious bot's IP(s) from accessing your site using cPanel's 'IP Deny Manager'. You can view our knowledgebase article on Blocking an IP from your Website

You can view more information on robots.txt files here:

http://en.wikipedia.org/wiki/Robots_exclusion_standard