The AI crawler landscape in 2026: who is crawling, how much, and what changed
The AI crawler ecosystem in 2026 looks nothing like it did in 2024. Two years ago, GPTBot and CCBot were the only AI crawlers most site owners had heard of. Today, more than a dozen distinct AI crawlers operate at scale, and their collective traffic has become a measurable share of total bot activity on the web.
This article provides a snapshot of the current landscape: who is active, how crawl volumes compare, what compliance looks like in practice, and what site owners should be watching.
The major crawlers in 2026
The first tier of AI crawlers — those operated by companies with the largest language models — includes GPTBot (OpenAI), ClaudeBot (Anthropic), Google-Extended (Google), and Bytespider (ByteDance). These four account for the majority of AI crawl volume on most sites.
GPTBot remains the most commonly discussed and most commonly blocked AI crawler. Its crawl patterns are moderate compared to some competitors: it respects crawl-delay signals (though this is not a formal part of the robots.txt specification), and it identifies itself clearly.
ClaudeBot crawls at lower volumes than GPTBot on most sites, consistent with Anthropic's more conservative approach to web data collection. The introduction of ClaudeBot-User as a separate agent for real-time retrieval was a positive step toward distinguishing training crawls from live queries.
Google-Extended is unique because it is operated by Google alongside Googlebot. Blocking Google-Extended prevents your content from being used in Gemini training while keeping Google Search indexing intact. This separation was a direct response to publisher demand for granular control.
Bytespider continues to generate the highest raw volume of requests on many sites. Its crawl aggressiveness — high request rates with minimal pausing — makes it one of the most commonly blocked crawlers by server administrators even before content policy considerations enter the picture.
The second tier
A growing number of AI crawlers operate below the volume of the big four but are increasingly visible in server logs:
Meta's crawlers (FacebookBot and Meta-ExternalAgent) support both content preview and AI training for Meta's Llama models. Applebot-Extended separates Apple Intelligence training from Siri and Spotlight indexing. PerplexityBot supports Perplexity AI's search product. Amazonbot feeds both Alexa and Amazon's AI product recommendations. Cohere and AI2 operate research-focused crawlers that contribute to academic and commercial model training.
Each of these crawlers has different disclosure standards, different documentation quality, and different compliance postures. The fragmentation makes per-agent configuration in robots.txt increasingly important.
What changed from 2024 to 2026
Several trends shaped the evolution of the landscape:
Agent differentiation accelerated. Google, Anthropic, and Apple all introduced separate user agents for training versus retrieval. This was the most important structural change because it gives site owners the ability to allow one use while blocking the other — a capability that did not exist in the binary allow/deny world of 2024.
Crawl volumes grew. The total volume of AI-related crawl requests increased across the board, reflecting both the expansion of training data pipelines and the growth of retrieval-augmented generation (RAG) systems that fetch pages in real time.
Compliance remained uneven. Major crawlers from well-known companies generally respect robots.txt. But undeclared or poorly documented crawlers continue to appear, particularly from smaller AI startups and research labs that do not invest in crawler transparency.
Regulatory pressure increased. The EU AI Act's transparency requirements and similar frameworks in Canada, Brazil, and Japan began influencing how AI companies document their crawlers. This pressure has not yet produced universal compliance, but it has improved disclosure from the largest operators.
What site owners should watch
The practical advice for 2026 is straightforward but requires more effort than in previous years:
Audit your server logs regularly. New AI crawlers appear without announcement. A quarterly review of unique user agents in your logs will reveal crawlers you may not have accounted for in your robots.txt.
Update your robots.txt for new agents. The list of known AI user agents is not static. When a new crawler appears in your logs, research its operator and decide whether to allow or block it. Better Robots.txt updates its crawler database with each release.
Monitor crawl volume trends. A sudden spike in bot traffic from a specific agent may indicate a new training run or a change in crawl behavior. Volume changes can also signal a crawler that is not respecting rate limits.
Distinguish training from retrieval. Where possible, use the differentiated user agents that AI companies have introduced. Allowing retrieval (which can send traffic) while blocking training (which extracts value without return) is the most balanced policy for most sites.
Do not ignore the second tier. Smaller crawlers individually contribute less traffic, but collectively they can match or exceed the major ones. A robots.txt that only addresses GPTBot, ClaudeBot, and Google-Extended is incomplete.
The AI crawler landscape will continue to evolve. The sites best positioned to manage it are those that treat their robots.txt as a living policy document, reviewed and updated regularly, rather than a file written once and forgotten.