The robots.txt file explained

Posted: 20/06/2017 in Geek Stuff, Google
Tags:

Image result for robots.txt

What is a robots.txt file?

Search engine through a program robot (also known as spider), automatically access the Internet page and access to web information.
You can create a plain text file, robots.txt, in your website that declares that the site does not want to be accessed by the robot so that part or all of the site’s content can not be included in the search engine, or Specifies that the search engine only includes the specified content.

Where is the robots.txt file?

The robots.txt file should be placed in the root directory of the site. For example, when a robots visit a website (such as https://www.linkedin.com ), it will first check whether the site exists https://www.linkedin.com/robots.txt this file, if the robot to find this file, it will be based on the contents of this file to determine the scope of its access.

The format of the robots.txt file

The “robots.txt” file contains one or more records, separated by blank lines (CR, CR / NL, or NL as the end), and the format of each record is as follows:

"<field>:<optionalspace><value><optionalspace>"

 

In the file can be used # for annotations, the specific use of the same practice and UNIX. The records in this file usually begin with one or more lines of User-agent, followed by a number of Disallow lines, as follows:

  • User-agent:
    The value of this item is used to describe the name of the search engine robot. In the “robots.txt” file, if there are multiple User-agent records that have multiple robots that are limited by the protocol, Say, at least one User-agent record. If the value is set to *, the protocol is valid for any robot. In the “robots.txt” file, there is only one record of “User-agent: *”.

 

  • Disallow:
    the value of the item used to describe the URL you do not want to visit, the URL can be a complete path, it can be part of any Disallow at the beginning of the URL will not be access to the robot. For example, “Disallow: /help” does not allow search engine access to /help.html and /help/index.html, and “Disallow: /help/” allows the robot to access /help.html without access to /help/index .html. Any Disallow record is empty, indicating that all parts of the site are allowed to be accessed, in the “/robots.txt” file, at least one Disallow record. If “/robots.txt” is an empty file, then for all the search engine robot, the site is open.
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s