Posts Tagged ‘robots.txt’

Image result for robots.txt

What is a robots.txt file?

Search engine through a program robot (also known as spider), automatically access the Internet page and access to web information.
You can create a plain text file, robots.txt, in your website that declares that the site does not want to be accessed by the robot so that part or all of the site’s content can not be included in the search engine, or Specifies that the search engine only includes the specified content.

Where is the robots.txt file?

The robots.txt file should be placed in the root directory of the site. For example, when a robots visit a website (such as https://www.linkedin.com ), it will first check whether the site exists https://www.linkedin.com/robots.txt this file, if the robot to find this file, it will be based on the contents of this file to determine the scope of its access.

The format of the robots.txt file

The “robots.txt” file contains one or more records, separated by blank lines (CR, CR / NL, or NL as the end), and the format of each record is as follows:

"<field>:<optionalspace><value><optionalspace>"

 

In the file can be used # for annotations, the specific use of the same practice and UNIX. The records in this file usually begin with one or more lines of User-agent, followed by a number of Disallow lines, as follows:

  • User-agent:
    The value of this item is used to describe the name of the search engine robot. In the “robots.txt” file, if there are multiple User-agent records that have multiple robots that are limited by the protocol, Say, at least one User-agent record. If the value is set to *, the protocol is valid for any robot. In the “robots.txt” file, there is only one record of “User-agent: *”.

 

  • Disallow:
    the value of the item used to describe the URL you do not want to visit, the URL can be a complete path, it can be part of any Disallow at the beginning of the URL will not be access to the robot. For example, “Disallow: /help” does not allow search engine access to /help.html and /help/index.html, and “Disallow: /help/” allows the robot to access /help.html without access to /help/index .html. Any Disallow record is empty, indicating that all parts of the site are allowed to be accessed, in the “/robots.txt” file, at least one Disallow record. If “/robots.txt” is an empty file, then for all the search engine robot, the site is open.

As you know, the majority of the webmasters upload a file called robots.txt to their servers in order to give instructions to the crawlers like Google, Yahoo, Bing… about what pages mustn’t be indexed.
Example:

Why does the webmaster want to hide some URLs? One of the first things the hackers can do is check these files. Hackers can get a lot of valuable information trying to locate the data, scripts… that the webmaster wants to keep hiding…

Sometimes Google indexes the robots.txt,  giving hackers the oportunity to locate words in this file through Google searches.

For example, if a hacker wants to locate users installations, he could use the robots.txt files indexed in Google to locate them and then try to exploit them.

inurl:.kh/robots.txt- + “Disallow: /user/ “

The hackers could locate WordPress installations by using…

inurl:”.com/robots.txt” + “Disallow: /wp-admin/

The hackers could locate Joomla installations by using…

inurl:”/robots.txt” + “Disallow: joomla”

The hackers could locate Plesk Statisticsin stallations by using…

inurl:”/robots.txt” + “Disallow:  plesk-stat”


The hackers could locate Drupal installations by using…inurl:”.com/robots.txt” + “Disallow: ?q=admin”
The hackers could locate Tinymce installations in order to try to get information about the plugins installed on these servers and then try to exploit them…
inurl:”.com/robots.txt” + “Disallow: tinymce”
Is someone trying to hide their password?.
inurl:”/robots.txt” + “Disallow: passwords.txt”>You should be careful when you are writing your robots.txt because if someone checks it or someone with imagination searches on Google with this types of queries,  you could be a hacker’s target…