How to Read a Robots.txt File
Most of us in security know robots.txt
as a way to find files or directories that webmasters don’t want found by a search engine (and mistakenly think this means it’s hidden from humans as well).
This post covers how robots.txt
files are read by search bots.
A Brief Overview of Robots.txt
Robots.txt is a standard (formally known as the robots exclusion standard
) used by website to communicate to web crawlers.
Typically, these crawlers or bots are trying to populate search engine content, determining which pages exist on the internet, what other pages they link to, and so on.
To avoid having irrelevant or unwanted content show up in search results, website owners use robots.txt
files to tell the web crawlers what to ignore. For example, login pages typically aren’t good search result candidates, so those might be included in a robots.txt
file.
Robots.txt is now the focus of SEO (search engine optimization) folks and not just DevOps. This is done to minimize low-relevancy pages from showing up in search results, allegedly raising the overall view of the website in the eyes of the search engine.
You should also be aware that some web CMSes autogenerate robots.txt files, so the presence of robots.txt
rules might be the result of manual entries, SEO tactics, and/or autogenerated output.
How Robots.txt Rules Work
The basic format of a robots.txt
rule is as follows:
User-agent: [user-agent string]
Disallow: [URL string of file or directory not to be crawled]
For example:
User-agent: Googlebot
Disallow: /secretDirectory/
The disallow strings can be stacked to disallow multiple files or directories for a given bot or crawler:
User-agent: Googlebot
Disallow: /secretDirectory/
Disallow: /moreSecrets/
Disallow: /passwords.txt
The same is true of user-agents:
User-agent: Googlebot
User-agent: Otherbot
Disallow: /noBotsAllowed/
A robots.txt file can also include an Allow
, for example, the following rule would allow Googlebot to crawl the whole site:
User-agent: Googlebot
Allow: /
The rule can also include an optional crawl-delay
which specifies in milliseconds how long a bot must wait before crawling.
A robots.txt
file can include as many groups of these rules as desired, and may also include a sitemap XML file.
Robots.txt and Security
Many beginner CTF challenges use robots.txt as a challenge idea, or as a stepping stone to finding another part of the website. This is because often times, webmasters mistake robots.txt as a security feature.
This is not only true of humans (who might see the listed files and directories and check them out), but some bots are programmed to ignore these listings, or even focus specifically on these robots.txt rules to find sensitive website contents and scan them for vulnerabilities.
One way around this (for website owners) is to hide sensitive files within a directory, turn directory listing off, then only list that directory in the robots.txt
file (not the individual files).
Many scanners, including Nikto, check robots.txt
for interesting values.
Conclusion
If you are a security practitioner, always check robots.txt
to help identify interesting pages and directories, and to help identify the site’s tech stack (if the file is autogenerated).
If you are a developer or work in DevOps, do not mistake robots.txt
for security, as both humans and bots can ignore your wishes! Any sensitive files will need additional security in the form of access control or other restrictions.