Skip to content

GPTBot, ClaudeBot, CCBot: who are the AI crawlers and what do they actually do

The web crawler ecosystem changed fundamentally between 2023 and 2025. For two decades, the dominant crawlers belonged to search engines: Googlebot, Bingbot, and a handful of others. Today, a new generation of bots operated by AI companies has become a significant share of crawler traffic — and their intentions, behaviors, and respect for site owner preferences vary widely.

Understanding who these crawlers are is the first step toward deciding how your site should respond to them.

GPTBot — OpenAI

GPTBot is OpenAI's web crawler, identified by the user-agent string GPTBot. It fetches publicly accessible web pages for two declared purposes: training future AI models and powering web browsing features in ChatGPT.

OpenAI published documentation stating that GPTBot respects robots.txt disallow rules. It also filters out paywalled content, content requiring login, and content that violates OpenAI's usage policies.

From a site owner's perspective, the key distinction is that GPTBot may use your content for model training, not just for answering user queries in real time. Blocking GPTBot in robots.txt prevents both uses.

ClaudeBot — Anthropic

ClaudeBot is operated by Anthropic, the company behind the Claude family of AI models. It crawls the web for training data collection and operates under the user-agent ClaudeBot (previously anthropic-ai).

Anthropic's documentation states that ClaudeBot respects robots.txt and supports targeted blocking. Like GPTBot, the primary use case is training data collection. Anthropic has published a separate user-agent, ClaudeBot-User, for real-time web retrieval when a Claude user asks to fetch a specific page — though this distinction is not yet widely adopted in robots.txt configurations.

CCBot — Common Crawl

CCBot is the crawler behind Common Crawl, a nonprofit organization that maintains a massive, publicly available archive of the web. Common Crawl data has been used as training material by virtually every major language model, including GPT, Claude, LLaMA, and many open-source models.

CCBot's user-agent is CCBot. It respects robots.txt, but with an important nuance: even if you block CCBot today, your content may already exist in older Common Crawl snapshots that were collected before you added the block. Those historical snapshots are freely downloadable by anyone.

This makes CCBot a uniquely complex case. Blocking it is still good practice for forward-looking protection, but it does not retroactively remove your content from existing datasets.

Bytespider — ByteDance

Bytespider is operated by ByteDance, the parent company of TikTok. It is one of the most aggressive crawlers on the web by volume. Bytespider has been observed making extremely high numbers of requests, often without meaningful pauses between them, which can create noticeable load on smaller hosting environments.

ByteDance has not published clear public documentation about Bytespider's purpose, but it is widely understood to collect data for AI model training and for ByteDance's own search products. Many site owners block Bytespider preemptively based on its crawl aggressiveness alone.

Other notable AI crawlers

The landscape includes several other bots that site owners should be aware of:

  • Google-Extended: Google's dedicated AI training crawler, separate from Googlebot. Blocking Google-Extended prevents your content from being used in Gemini training while keeping your Google Search indexing intact.
  • FacebookBot: Used by Meta for AI training and content preview. The user-agent has existed for years but its role expanded with Meta's AI initiatives.
  • Amazonbot: Amazon's crawler for AI and Alexa-related services.
  • Applebot-Extended: Apple's AI training crawler, distinct from the standard Applebot used for Siri and Spotlight.
  • PerplexityBot: Operates for Perplexity AI, a search-focused AI product.

How AI crawlers differ from search engine crawlers

The fundamental difference is intent. Googlebot fetches your content to build a search index that sends traffic back to your site. There is a reciprocal relationship: you provide content, Google provides visitors.

AI crawlers break this reciprocity. They fetch your content to train models or generate answers — often without sending any traffic back to your site. When ChatGPT or Claude answers a user's question using knowledge derived from your content, the user has no reason to visit your site. The value extraction is one-directional.

This does not make AI crawlers inherently harmful. Some AI-powered search products (like Perplexity or Google's AI Overviews) do provide attribution and sometimes traffic. But the default dynamic is extractive, not reciprocal.

What you can do

The robots.txt file remains the primary mechanism for communicating your preferences to AI crawlers. Most major AI crawlers respect it, though compliance is declarative, not enforceable. No technical mechanism prevents a bot from ignoring your robots.txt — it is a social contract, not a firewall.

For WordPress site owners, the practical steps are:

  1. Know which AI crawlers are visiting your site (check your server logs or use a plugin that reports bot traffic).
  2. Decide which crawlers you want to allow and which you want to block, based on your content strategy and business model.
  3. Implement specific User-agent blocks for each category rather than a blanket allow or deny.
  4. Review these rules regularly, because new AI crawlers appear frequently and existing ones change their behavior.

Better Robots.txt simplifies this by organizing AI crawlers into a dedicated governance module where you can set policies per bot or per category, with a preview of the final output before anything changes.