Robots.txt: Difference between revisions
| Docs admin (talk | contribs) No edit summary | Docs admin (talk | contribs) m Docs admin moved page What is a robots.txt file? to Robots.txt | ||
| (One intermediate revision by the same user not shown) | |||
| Line 55: | Line 55: | ||
| [[Category:Technical Support FAQ]] | [[Category:Technical Support FAQ]] | ||
| [[Category:SEO]] | |||
Latest revision as of 12:23, 27 December 2012
Search engine optimization (SEO) is an important thing in this day of information technology. Most search engines utilize site crawlers, spiders, or simply just bots to ensure their searches find the proper results. However, not all bots are beneficial and sometimes even the ones that are can cause issues on your site.
This article will explain how a robots.txt file within your account can help you manage the bots that crawl your site. The first things to understand are the options you can put into your robots.txt file:
robots.txt directives
User-agent
<syntaxhighlight lang="bash">User-agent:</syntaxhighlight>
This specifies the bot you want to pass instructions to. This should be the first line in your robots.txt file. You can either specify all bots using the wildcard symbol * or you can name them individually (bing, yahoo, etc).
Disallow
<syntaxhighlight lang="bash">Disallow:</syntaxhighlight>
This is the line you will most commonly follow the previous option with. Following this option, you will designate a folder or file that you do not want to be 'crawled'. For example, you can specify simply / to prevent any of your files from being 'crawled' or you can get more specific and designate something like /images/image.jpg to stop that specific image file from being 'crawled'.
Allow
<syntaxhighlight lang="bash">Allow:</syntaxhighlight>
This line will specify any exceptions to the disallow rule that you want certain bots to follow. Note that only the major search bots follow the allow option overide, so Bing, Yahoo, and etc.
Crawl-delay
<syntaxhighlight lang="bash">Crawl-delay:</syntaxhighlight>
This is the most commonly used option in a robots.txt file as it slows down the rate at which bots hit your site. This will prevent bots from consuming all the resources available for your site as they continue to index it. After the option, you will place the amount of seconds that you want to delay the bots by.
Example robots.txt
So, lets say we configured a robots.txt file with the following options:
<syntaxhighlight lang="bash"> User-Agent: * Allow: / Disallow: /administrator Disallow: /cgi-bin </syntaxhighlight>
That robots.txt file would allow all folders but the administrator and cgi-bin folders to be crawled and indexed by all search engines.
These options will help you to control the traffic of bots that access your site.
Googlebot is also controlled by "Google Webmaster Tools" which you can sign up/access here:
http://www.google.com/webmasters/tools/
It is also important to know that malicious bots do not adhere to the robots.txt file. You would need to block access from any malicious bots. If you have access to cPanel, you can block the malicious bot's IP(s) from accessing your site using cPanel's 'IP Deny Manager'. You can view our knowledgebase article on Blocking an IP from your Website
You can view more information on robots.txt files here: