News

Google Open Sources Its ‘Web Crawler’ After 20 Years

Google’s Robot Exclusion Protocol (REP), also known as robots.txt, is a standard used by many websites to tell the automated crawlers which parts of the site should be crawled or not.

However, it isn’t the officially adopted standard, leading to different interpretations. In a bid to make REP an official web standard, Google has open-sourced robots.txt parser and the associated C++ library which it first created 20 years back. You can find the tool on GitHub.

REP was conceived back in 1994 by a Dutch software engineer Martijn Koster, and today it is the de facto standard used by websites to instruct crawlers.

Googlebot crawler scours the robots.txt file to find any instructions on which parts of the website it should ignore. If there’s no robots.txt file, the bot assumes that it’s okay to crawl the entire website.


However, this protocol has been interpreted “somewhat differently over the years” by developers, leading to ambiguity and difficulty in “writing the rules correctly.”

For instance, there is uncertainty in cases where the “text editor includes BOM characters in their robots.txt files.” Whereas for crawler and tool developers, there is always uncertainty about “how should they deal with robots.txt files that are hundreds of megabytes large?”

This is why Google wants REP to be officially adopted as an internet standard with fixed rules for all. The company says it has documented exactly how REP should be used and submitted its proposal to the Internet Engineering Task Force (IETF).

While we cannot say with certainty that REP will become an official standard; it would definitely help web visitors as well as website owners by showing more consistent search results and respecting the site’s wishes.

To Top

Pin It on Pinterest

Share This