Nov 22, 2021 2 min read appsec

How to Read a Robots.txt File

Most of us in security know robots.txt as a way to find files or directories that webmasters don’t want found by a search engine (and mistakenly think this means it’s hidden from humans as well).

This post covers how robots.txt files are read by search bots.

A Brief Overview of Robots.txt

Robots.txt is a standard (formally known as the robots exclusion standard) used by website to communicate to web crawlers.

Typically, these crawlers or bots are trying to populate search engine content, determining which pages exist on the internet, what other pages they link to, and so on.

To avoid having irrelevant or unwanted content show up in search results, website owners use robots.txt files to tell the web crawlers what to ignore. For example, login pages typically aren’t good search result candidates, so those might be included in a robots.txt file.

Robots.txt is now the focus of SEO (search engine optimization) folks and not just DevOps. This is done to minimize low-relevancy pages from showing up in search results, allegedly raising the overall view of the website in the eyes of the search engine.

You should also be aware that some web CMSes autogenerate robots.txt files, so the presence of robots.txt rules might be the result of manual entries, SEO tactics, and/or autogenerated output.

How Robots.txt Rules Work

The basic format of a robots.txt rule is as follows:

User-agent: [user-agent string]
Disallow: [URL string of file or directory not to be crawled]

For example:

User-agent: Googlebot
Disallow: /secretDirectory/

The disallow strings can be stacked to disallow multiple files or directories for a given bot or crawler:

User-agent: Googlebot
Disallow: /secretDirectory/
Disallow: /moreSecrets/
Disallow: /passwords.txt

The same is true of user-agents:

User-agent: Googlebot
User-agent: Otherbot
Disallow: /noBotsAllowed/

A robots.txt file can also include an Allow, for example, the following rule would allow Googlebot to crawl the whole site:

User-agent: Googlebot
Allow: /

The rule can also include an optional crawl-delay which specifies in milliseconds how long a bot must wait before crawling.

A robots.txt file can include as many groups of these rules as desired, and may also include a sitemap XML file.

Robots.txt and Security

Many beginner CTF challenges use robots.txt as a challenge idea, or as a stepping stone to finding another part of the website. This is because often times, webmasters mistake robots.txt as a security feature.

This is not only true of humans (who might see the listed files and directories and check them out), but some bots are programmed to ignore these listings, or even focus specifically on these robots.txt rules to find sensitive website contents and scan them for vulnerabilities.

One way around this (for website owners) is to hide sensitive files within a directory, turn directory listing off, then only list that directory in the robots.txt file (not the individual files).

Many scanners, including Nikto, check robots.txt for interesting values.

Conclusion

If you are a security practitioner, always check robots.txt to help identify interesting pages and directories, and to help identify the site’s tech stack (if the file is autogenerated).

If you are a developer or work in DevOps, do not mistake robots.txt for security, as both humans and bots can ignore your wishes! Any sensitive files will need additional security in the form of access control or other restrictions.

A Brief Overview of Robots.txt

How Robots.txt Rules Work

Robots.txt and Security

Conclusion

You might also like...