There several reasons and scenarios which arise to control the access of the web robots or web crawlers or simple spiders, to our website. Like Google-bot (Google Spider) visiting our website, spam bots too will visit. Spam bots usually visit and collect private information from our website. When a robot crawls our website it uses a considerable amount of the website’s bandwidth too! It is easy to control robots by disallowing the access of the web robots to our website through the usage of a simple ‘robots.txt’ file.
Creating a robots.txt:
Open a new File in any Text Editor Like Notepad.
The rules in the robots.txt file are entered in a ‘field’: ‘value’ pair.
Can have possible two values: allow or disallow for a particular URL
An URL or URI that the access or rule is specified.
If we want to exclude all the search engine robots from indexing our entire website , then do enter the following into the robots.txt file:
If we want to exclude all the bots from a certain directory within our website, we would write the following:
For multiple directories, we add on similar Disallow values
Access to specific documents can also be specifoed.
If we want to disallow a specific search engine bot from indexing our website,
Advantages of Using Robots.txt:
- Avoid Wastage of Server Resources
- Save Bandwidth
- Removes Clutter and complexity from Web Statistics and more smooth anlytics
- Refusing a specific Robots
Common Errors and mistakes in robots.txt:
- It is not Guaranteed to Work
Instead use .htaccess file in combination with .htpasswd
- It is not a method to protect Secret Directories
Any bot or agent can access robots.txt. Don’t put any secret directories or files in it.
- Only One Directory/File per Disallow line
The rules are singular in sense. One rules or fields contains only one value.