News

Google Open Sources Its ‘Web Crawler’ After 20 Years

By root

Posted on July 2, 2019

Google’s Robot Exclusion Protocol (REP), also known as robots.txt, is a standard used by many websites to tell the automated crawlers which parts of the site should be crawled or not.

However, it isn’t the officially adopted standard, leading to different interpretations. In a bid to make REP an official web standard, Google has open-sourced robots.txt parser and the associated C++ library which it first created 20 years back. You can find the tool on GitHub.

REP was conceived back in 1994 by a Dutch software engineer Martijn Koster, and today it is the de facto standard used by websites to instruct crawlers.

Googlebot crawler scours the robots.txt file to find any instructions on which parts of the website it should ignore. If there’s no robots.txt file, the bot assumes that it’s okay to crawl the entire website.

However, this protocol has been interpreted “somewhat differently over the years” by developers, leading to ambiguity and difficulty in “writing the rules correctly.”

For instance, there is uncertainty in cases where the “text editor includes BOM characters in their robots.txt files.” Whereas for crawler and tool developers, there is always uncertainty about “how should they deal with robots.txt files that are hundreds of megabytes large?”

This is why Google wants REP to be officially adopted as an internet standard with fixed rules for all. The company says it has documented exactly how REP should be used and submitted its proposal to the Internet Engineering Task Force (IETF).

While we cannot say with certainty that REP will become an official standard; it would definitely help web visitors as well as website owners by showing more consistent search results and respecting the site’s wishes.

Related Items:Crawler, Google, REP, Robot Exclusion Protocol, robots.txt

MrHacker

Google Open Sources Its ‘Web Crawler’ After 20 Years

Latest News

Russian APT Deploys New ‘Kapeka’ Backdoor in Eastern European Attacks

Critical Atlassian Flaw Exploited to Deploy Linux Variant of Cerber Ransomware

Hackers Exploit Fortinet Flaw, Deploy ScreenConnect, Metasploit in New Campaign

TA558 Hackers Weaponize Images for Wide-Scale Malware Attacks

AWS, Google, and Azure CLI Tools Could Leak Credentials in Build Logs

Links

Pin It on Pinterest