Cloudflare is observing stealth crawling behaviour from Perplexity, an AI-powered answer engine. Although Perplexity initially crawls from their declared user agent, when they are presented with a network block, they appear to obscure their crawling identity in an attempt to circumvent the website’s preferences.
Cloudflare sees continued evidence that Perplexity is repeatedly modifying their user agent and changing their source ASNs to hide their crawling activity, as well as ignoring — or sometimes failing to even fetch — robots.txt files. The Internet for the past three decades is rapidly changing, but one thing remains constant: it is built on trust. There are clear preferences that crawlers should be transparent, serve a clear purpose, perform a specific activity, and, most importantly, follow website directives and preferences. Based on Perplexity’s observed behaviour, which is incompatible with those preferences, Cloudflare have de-listed them as a verified bot and added heuristics to its managed rules that block this stealth crawling.
How well-meaning bot operators respect website preferences
The Internet has expressed clear preferences on how good crawlers should behave. All well-intentioned crawlers acting in good faith should:
- Be transparent. Identify themselves honestly, using a unique user-agent, a declared list of IP ranges or Web Bot Authintegration, and provide contact information if something goes wrong.
- Be well-behaved netizens. Don’t flood sites with excessive traffic, scrapesensitive data, or use stealth tactics to try and dodge detection.
- Serve a clear purpose. Whether it’s powering a voice assistant, checking product prices, or making a website more accessible, every bot has a reason to be there. The purpose should be clearly and precisely defined and easy for site owners to look up publicly.
- Separate bots for separate activities. Perform each activity from a unique bot. This makes it easy for site owners to decide which activities they want to allow. Don’t force site owners to make an all-or-nothing decision.
- Follow the rules. That means checking for and respecting website signals like robots.txt, staying within rate limits, and never bypassing security protections.