Skip to main contentSkip to content

Bot taxonomy

Better Robots.txt works best when machine visitors are separated by role, not only by brand or user-agent string.

That is the only stable way to publish policy without collapsing very different access patterns into one vague "AI bot" category.

Why the taxonomy must change

A flat taxonomy was good enough when most sites only had to distinguish:

  • search bots;
  • everything else.

That is no longer enough.

A practical taxonomy now needs to distinguish not only who is visiting, but also:

  • why they are visiting;
  • whether the traffic is automatic or user-triggered;
  • what control surface actually governs the decision.

Without that, teams make category mistakes.

They block Search when they only meant to refuse training.
They publish robots.txt rules for traffic that actually requires edge controls.
They treat answer-retrieval and model training as if they were the same use case.

Primary categories

Search crawlers

These bots support discovery, indexing, and refresh in search products.

Typical policy question: do you want to remain discoverable where those search systems can still send traffic back?

Training crawlers or training tokens

These surfaces exist to support future model development rather than direct search visibility.

Typical policy question: do you want your content used for training or improving future generative systems?

Answer or retrieval systems

These systems support answer generation, retrieval, grounding, or search-answer quality.

Typical policy question: do you want your content pulled into answer pipelines or retrieval workflows?

User-triggered fetchers

These requests exist because an end user asked a product to visit, fetch, or act on a URL.

Typical policy question: is this still a normal crawl decision, or should it be governed differently because the request is user-initiated?

Signed or verified agents

These are agents whose identity may be verified or allowlisted at the CDN, WAF, or infrastructure layer.

Typical policy question: is the real control problem now identity, verification, and runtime permissions rather than public crawl policy?

Archive bots

These bots capture or replay content for preservation or historical access.

Typical policy question: is archive capture acceptable, neutral, or undesirable for this site profile?

SEO tool bots

These bots crawl for SEO research, monitoring, and competitive intelligence.

Typical policy question: does the value justify the crawl load and extraction exposure?

Low-value or abusive bots

These bots create cost, extraction pressure, or noise without meaningful value to the site.

Typical policy question: should this be handled as a crawl policy issue, or as an infrastructure abuse issue?

Secondary decision dimensions

A strong taxonomy also needs the following decision dimensions.

Discovery value

Does this machine visitor help the site appear where the site wants to be found?

Reuse or extraction risk

Does it increase low-return extraction or downstream reuse risk?

Trigger mode

Is the access automatic, user-triggered, mixed, or unclear?

Primary control surface

Is the decision mainly governed by:

  • robots.txt
  • usage signals
  • llms.txt
  • page-level Search controls
  • infrastructure or edge controls

Verifiability

Can the request be verified in a trustworthy way, or is it only a claimed user agent string?

Infrastructure cost

Does the traffic create meaningful server, crawl, or operational cost?

Reversibility

Can the site safely change the decision later without major visibility damage?

One vendor can span several categories

This is the most important operational rule in the taxonomy.

A vendor should never be assumed to map to one category only.

Examples:

  • Google can appear as Search crawl, downstream-use control, and user-triggered agent traffic.
  • OpenAI can appear as search bot, training bot, user-triggered visitor, and signed agent.
  • Anthropic can appear as training, search optimization, and user-directed retrieval.
  • Apple can appear as Search crawl and downstream data-usage control.

That is why the taxonomy classifies roles first and names second.

What this taxonomy is for

This taxonomy helps teams:

  • separate policy decisions before publication;
  • avoid mixing Search, training, retrieval, and user-triggered traffic;
  • choose the right control surface;
  • reduce contradictions across governance outputs.

What this taxonomy does not prove

This taxonomy does not prove:

  • crawler compliance;
  • technical enforcement;
  • legal enforceability by itself;
  • live runtime state;
  • whether a claimed user agent is authentic.

It is a governance classification layer. That is its role.