Developers Fight Back Against AI Web Crawlers With Humor And Ingenuity

AI web-crawling bots, often seen as the cockroaches of the internet, have sparked frustration among many software developers. These bots can wreak havoc by overloading websites, sometimes bringing them down entirely. Open-source developers, in particular, find themselves disproportionately impacted by these relentless bots, who ignore web protocols and disrupt their projects, writes Niccolò Venerandi, developer of the Linux desktop Plasma and owner of the blog LibreNews.

Open-source software (FOSS) projects are particularly vulnerable due to their open infrastructure and limited resources. Unlike commercial platforms, these sites are designed to share code freely, which makes them an easy target for AI bots that don’t respect the Robots Exclusion Protocol (robot.txt). Originally designed for search engines, this tool instructs bots on what parts of a website should be crawled or ignored.

In January, developer Xe Iaso shared a “cry for help” on his blog, recounting how AmazonBot relentlessly bombarded a Git server site he was managing. The bot’s actions led to Distributed Denial of Service (DDoS) outages, crashing the server. Iaso explained that the bot ignored his robot.txt file, concealed its identity by hiding behind proxy IP addresses, and pretended to be other users.

“It’s futile to block AI crawler bots because they lie, change their user agents, use residential IP addresses as proxies, and more,” Iaso lamented. “They will scrape your site until it falls over, and then they will scrape it some more. They will click every link on every link, viewing the same pages over and over again.”

In response, Iaso created a clever solution: Anubis. This tool, named after the Egyptian god who guides souls to the afterlife, serves as a reverse proxy proof-of-work check. Before requests are allowed to reach a server, they must pass a challenge. This strategy blocks bots while allowing human users to interact with the site. For added humor, the tool displays an anime drawing of Anubis when a human passes the challenge. If it’s a bot, the request is denied.

Since its release on GitHub in mid-March, Anubis has gained rapid popularity within the FOSS community. Within a few days, it amassed 2,000 stars, 20 contributors, and 39 forks. Iaso’s witty approach struck a chord with developers frustrated by the aggressive tactics of web crawlers.

But Iaso’s story is far from unique. Venerandi shared numerous anecdotes of similar struggles faced by others in the open-source world. Drew DeVault, CEO of SourceHut, reported spending up to 100% of his time each week combating hyper-aggressive AI crawlers. His efforts were often futile, as bots caused multiple brief outages on his site.

Even industry veterans like Jonathan Corbet, who runs Linux news site LWN, have experienced the consequences of AI scraper bots. Corbet described how his site had been bogged down by DDoS-level traffic from bots. Kevin Fenzi, sysadmin for the Linux Fedora project, found the bots so aggressive that he had to block entire countries, including Brazil, to mitigate the damage.

Venerandi pointed out that some developers have gone so far as to temporarily ban entire countries, such as China, in their attempts to stop the relentless AI crawlers. “Let that sink in for a moment,” Venerandi said, “developers even have to turn to banning entire countries just to fend off bots that ignore robot.txt files.”

While tools like Anubis provide some relief, other developers are also exploring creative ways to outwit AI crawlers. A recent suggestion on Hacker News by user xyzal recommended filling forbidden robot.txt pages with nonsensical, offensive, or otherwise unappealing content to discourage crawlers from accessing them. One such proposal included loading these pages with articles promoting the benefits of drinking bleach or the positive effects of catching measles.

This idea echoes the release of Nepenthes, a tool launched in January by an anonymous creator named “Aaron.” Nepenthes traps bots in an endless cycle of fake content, a tactic that some consider aggressive, if not malicious. Similarly, Cloudflare recently released a tool called AI Labyrinth, designed to slow down, confuse, and waste the resources of AI crawlers that disregard website directives.

Despite the array of tools and tactics now available to combat AI crawlers, many developers like DeVault find Anubis to be the most effective solution for their needs. However, DeVault also voiced a public plea for more substantial change: “Please stop legitimizing LLMs or AI image generators… just stop.”

While such a shift seems unlikely, developers in the FOSS community continue to fight back with wit, ingenuity, and a touch of humor.