What are web robots and robots.txt files?
A web robot, also known as a web crawler or a spider, is a program that crawls web sites and their pages automatically. Search engines, including Google, use web robots to traverse the web and index the content. A robots.txt file is created by website owners to give directions to web robots about the website, specifically indicating which parts or URLs of the website they do not want the web robot to crawl.
In other words, the robots.txt file is used to control the traffic of web robots in cases where the website owners don't want the crawler to crawl through pages on the website.
Why is a robots.txt file important?
Instruct the Search Engine Robots
The robots.txt file acts only as instructions to web crawlers and does not force the crawler to act according to the instructions. Most search engine robots follow the directive, but it is possible that some web crawlers may ignore them.
Search engines like Google will generally not index the web pages that have been blocked by a robots.txt file. However, a URL that has been disallowed may still be found from other places on the web such as, when it referenced from other websites. As a result, even though a URL may have been blocked by robots.txt, it may still appear in the search engine’s results. To block such content, you will have to employ other blocking methods.
You need to exercise caution while blocking pages from Google’s crawler because it could result in the crawler not being able to correctly analyze your website and its pages, affecting your website’s ranking in the search results.
How to create a robots.txt file
You can create a robots.txt file and place it the top-most directory of the webserver on which your website runs. Robots.txt, which is a text file, can contain either a single record or multiple records. A website owner can use the robots.txt file to block resources which are not important including style files or images.
Below are a few examples of robots.txt files.
This blocks all web robots from accessing all the content on the website
This gives access to all web robots to all the content contained on the website
This specifically blocks the Google web crawler from accessing a specific web page called sampleURL.html
Important Parameters of robots.txt file
This defines the web robot you are instructing. In case you want to give a directive to all web crawlers, you must use "*" as the parameter within the User-agent field.
This defines the URL(s) you do not wish the web crawler to crawl. For each URL you want to block, you must use a Disallow parameter on a new line.
This instruction is specifically for a Google crawler, and it tells the bot that even if the parent folder has been blocked, Googlebot is allowed to crawl a particular URL contained within that parent folder.
This defines the amount of time (in milliseconds) that the web crawler should wait before crawling the content.
Note: Googlebot does not follow this instruction
Web crawlers from Google, Bing, Yahoo and Ask support this instruction. It is used to indicate the location of XML sitemaps that are associated with the indicated URL.