Bot taxonomy
Better Robots.txt works best when machine visitors are separated by role, not only by brand or user-agent string.
That is the only stable way to publish policy without collapsing very different access patterns into one vague "AI bot" category.
Why the taxonomy must change
A flat taxonomy was good enough when most sites only had to distinguish:
- search bots;
- everything else.
That is no longer enough.
A practical taxonomy now needs to distinguish not only who is visiting, but also:
- why they are visiting;
- whether the traffic is automatic or user-triggered;
- what control surface actually governs the decision.
Without that, teams make category mistakes.
They block Search when they only meant to refuse training.
They publish robots.txt rules for traffic that actually requires edge controls.
They treat answer-retrieval and model training as if they were the same use case.
Primary categories
Search crawlers
These bots support discovery, indexing, and refresh in search products.
Typical policy question: do you want to remain discoverable where those search systems can still send traffic back?
Training crawlers or training tokens
These surfaces exist to support future model development rather than direct search visibility.
Typical policy question: do you want your content used for training or improving future generative systems?
Answer or retrieval systems
These systems support answer generation, retrieval, grounding, or search-answer quality.
Typical policy question: do you want your content pulled into answer pipelines or retrieval workflows?
User-triggered fetchers
These requests exist because an end user asked a product to visit, fetch, or act on a URL.
Typical policy question: is this still a normal crawl decision, or should it be governed differently because the request is user-initiated?
Signed or verified agents
These are agents whose identity may be verified or allowlisted at the CDN, WAF, or infrastructure layer.
Typical policy question: is the real control problem now identity, verification, and runtime permissions rather than public crawl policy?
Archive bots
These bots capture or replay content for preservation or historical access.
Typical policy question: is archive capture acceptable, neutral, or undesirable for this site profile?
SEO tool bots
These bots crawl for SEO research, monitoring, and competitive intelligence.
Typical policy question: does the value justify the crawl load and extraction exposure?
Low-value or abusive bots
These bots create cost, extraction pressure, or noise without meaningful value to the site.
Typical policy question: should this be handled as a crawl policy issue, or as an infrastructure abuse issue?
Secondary decision dimensions
A strong taxonomy also needs the following decision dimensions.
Discovery value
Does this machine visitor help the site appear where the site wants to be found?
Reuse or extraction risk
Does it increase low-return extraction or downstream reuse risk?
Trigger mode
Is the access automatic, user-triggered, mixed, or unclear?
Primary control surface
Is the decision mainly governed by:
robots.txt- usage signals
llms.txt- page-level Search controls
- infrastructure or edge controls
Verifiability
Can the request be verified in a trustworthy way, or is it only a claimed user agent string?
Infrastructure cost
Does the traffic create meaningful server, crawl, or operational cost?
Reversibility
Can the site safely change the decision later without major visibility damage?
One vendor can span several categories
This is the most important operational rule in the taxonomy.
A vendor should never be assumed to map to one category only.
Examples:
- Google can appear as Search crawl, downstream-use control, and user-triggered agent traffic.
- OpenAI can appear as search bot, training bot, user-triggered visitor, and signed agent.
- Anthropic can appear as training, search optimization, and user-directed retrieval.
- Apple can appear as Search crawl and downstream data-usage control.
That is why the taxonomy classifies roles first and names second.
What this taxonomy is for
This taxonomy helps teams:
- separate policy decisions before publication;
- avoid mixing Search, training, retrieval, and user-triggered traffic;
- choose the right control surface;
- reduce contradictions across governance outputs.
What this taxonomy does not prove
This taxonomy does not prove:
- crawler compliance;
- technical enforcement;
- legal enforceability by itself;
- live runtime state;
- whether a claimed user agent is authentic.
It is a governance classification layer. That is its role.