Why robots.txt is not enough for user-triggered AI agents
The modern web is no longer visited only by classic crawlers that discover pages for search.
Some machine visitors build search indexes. Some collect data for future model training. Some fetch pages to support answer-generation or grounding at query time. Some act only when a human asks them to. Some are verified and allowlisted at the CDN or WAF layer.
That changes the core question.
The question is no longer only which bots should I block in robots.txt? The better question is which control surface fits which type of machine access?
That distinction matters because robots.txt is still foundational, but it was designed for crawl access. It was not designed to cover every modern agentic workflow on its own.
robots.txt was built for crawl, not for every machine use
robots.txt remains the base layer of public crawl governance.
It is still the first file most automated crawlers check. It still gives site owners the most portable way to allow or disallow crawl by path and by agent. It is still the safest place to express category-level crawl intent.
But robots.txt only answers a specific question:
Can this crawler fetch this URL?
It does not, by itself, answer all of the other questions that now matter:
- how search snippets or previews should behave;
- whether content may be reused for training;
- whether content may be pulled into answer-generation at query time;
- whether a user-triggered fetch should be treated like automatic crawl;
- whether a signed agent should be allowed through runtime infrastructure controls.
So the problem in 2026 is not that robots.txt became useless.
The problem is that many teams now expect it to solve jobs that belong to other layers.
The 4 machine-access families you need to separate
A practical governance model now starts by separating at least four families of machine access.
1. Search crawlers
These crawlers support search discovery, re-crawl, indexing, and refresh.
Typical examples include classic search bots such as Googlebot and Applebot, and search-oriented AI discovery bots such as OAI-SearchBot.
The main business question is usually visibility.
Do you want this site to remain discoverable in search products that can still send traffic back?
2. Training crawlers
These crawlers or product tokens are about future model development, not search discovery.
Typical examples include GPTBot, ClaudeBot, Google-Extended, and Applebot-Extended.
The main question is not indexing. It is downstream model use.
Do you want your content used to train or improve future generative systems?
3. Answer or retrieval systems
These systems exist to support answers, grounding, search optimization, or retrieval at or near query time.
They may work like crawlers, like retrieval bots, or like hybrid systems that mix indexing and on-demand fetching.
The main question is not only visibility. It is how your content is consumed in answers.
Do you want link-and-snippet discovery, or live model input, or neither?
4. User-triggered or signed agents
These are the systems that most often break the old mental model.
Some requests are initiated by users. Some are authenticated or cryptographically verified by infrastructure providers. Some are designed to navigate the web and perform actions, not only fetch pages for indexing.
This is where robots.txt may stop being the primary lever.
Same vendor, different control surfaces
One of the biggest governance mistakes is assuming one vendor equals one crawler and one control point.
That is no longer true.
| Vendor | Surface | Main role | Primary control question |
|---|---|---|---|
Googlebot | Search crawl and search access | Do I want Google Search crawl and indexing? | |
Google-Extended | Training and some other Google grounding uses | Do I allow this content to be reused outside Search in those systems? | |
Google-Agent | User-triggered agent traffic | Is this traffic governed only by crawl policy, or does it require infrastructure handling? | |
| OpenAI | OAI-SearchBot | Search visibility in ChatGPT search features | Do I want appearance in OpenAI search results? |
| OpenAI | GPTBot | Training collection | Do I allow training reuse? |
| OpenAI | ChatGPT-User | User-initiated visits | Should user-triggered retrieval be allowed here? |
| OpenAI | ChatGPT agent | Signed, allowlistable agent traffic | Do I need edge-level allowlisting and verification? |
| Anthropic | ClaudeBot | Training collection | Do I allow training collection? |
| Anthropic | Claude-SearchBot | Search optimization | Do I want my content indexed for Claude search quality? |
| Anthropic | Claude-User | User-directed retrieval | Do I allow user-triggered fetches? |
| Apple | Applebot | Search crawl for Apple surfaces | Do I want Apple search visibility? |
| Apple | Applebot-Extended | Data-usage control only | Do I allow Apple model-training use of Applebot-crawled data? |
| Bing / Microsoft | bingbot plus meta controls | Search crawl plus downstream Bing Chat / Copilot controls | Do I need crawl rules, preview rules, or both? |
The key lesson is simple:
A vendor name is not a governance strategy.
You need a per-surface strategy.
Use the right control surface for the right job
Once the access family is clear, the correct control layer becomes much easier to choose.
Use robots.txt for crawl access
Use robots.txt when the real question is whether a crawler should be allowed to fetch certain URLs at all.
This is the right surface for:
- search crawl access;
- many training crawler opt-outs;
- per-agent path rules;
- low-value path reduction;
- category-level crawl hygiene.
This is still the most important base layer.
Use meta robots or X-Robots-Tag for preview and indexing behavior
When the question is no longer crawl access but what may appear in search or preview output, page-level controls matter.
That is where meta robots and X-Robots-Tag belong.
They are the right layer for questions such as:
- should this page be indexed;
- should snippets be shown;
- should previews be reduced;
- should a resource stay crawlable but not indexable?
This matters especially when a search ecosystem says Search behavior is still governed by its search crawler, not by a separate AI-training token.
Use AI usage signals when the question is downstream use
If your question is no longer "can it fetch?" but "what kinds of reuse do I allow?", then you are in the usage layer.
That is where AI usage signals and Content-Signal style distinctions become useful.
Typical distinctions include:
searchai-inputai-train
These are not universal enforcement mechanisms. But they express a more precise usage posture than binary crawl access alone.
Use llms.txt and supporting governance files for attention and interpretation
llms.txt does not replace robots.txt. It does a different job.
It helps models and machine readers understand what matters on the site, what should be prioritized, and which reading path is safest.
The same is true for supporting governance surfaces such as:
These surfaces reduce misreading. They do not act as a firewall.
Use edge or infrastructure controls for signed and verified agents
Once you enter the world of signed agents, allowlisting, verification, or runtime permissions, you are no longer only in the crawl-policy layer.
That is infrastructure territory.
This can include:
- CDN or WAF allowlisting;
- bot verification at the edge;
- request validation;
- runtime rate controls;
- session and permission boundaries.
This is where the plugin should not be overclaimed.
What Better Robots.txt actually does
Better Robots.txt helps with the part that belongs to the publishing and governance layer.
It helps site owners:
- publish a structured
robots.txtpolicy; - separate crawler categories before policy is generated;
- publish AI usage signals and supporting governance surfaces;
- publish
llms.txtand related machine-readable guidance; - keep the site’s published policy more coherent across outputs.
In other words, it helps you choose and publish a clearer policy.
It does not transform a policy file into runtime enforcement.
What belongs to infrastructure, not to the plugin
Some controls clearly live outside the plugin’s core scope.
That includes:
- signed-agent allowlisting;
- cryptographic verification;
- CDN and WAF rules;
- rate-limiting and abuse controls;
- authentication and session permissions;
- runtime enforcement of access decisions.
That is why Better Robots.txt should be described as a governance and publishing layer, not as a security product and not as a guaranteed enforcement layer.
Quick answers
Is Google-Agent the same thing as Google-Extended?
No.
Google-Extended is a product token about downstream use of Google-crawled content for future Gemini training and some other grounding uses. Google-Agent is user-triggered agent traffic used by agents hosted on Google infrastructure. Those are different surfaces and different control problems.
Does blocking training automatically block discoverability?
No.
In many ecosystems, you can still stay visible in search while refusing some training-related reuse. That is the point of separating search crawlers from training crawlers or tokens.
Does llms.txt replace robots.txt?
No.
robots.txt remains the crawl-access layer. llms.txt is an attention and guidance layer. They solve different problems.
When do I need to move from plugin settings to edge controls?
When the problem requires verified identity, signed traffic, allowlisting, runtime permissions, WAF logic, or abuse control. At that point, the publishing layer is still useful, but it is no longer enough on its own.
Related
- Google-Extended vs Googlebot
- ChatGPT-User vs GPTBot vs OAI-SearchBot
- Claude-User vs ClaudeBot vs Claude-SearchBot
- Applebot vs Applebot-Extended
- Bing, noarchive, nocache, and Copilot
- Robots.txt vs signed agent allowlisting
- Search vs ai-input vs ai-train
- The AI crawler landscape in 2026
- The machine governance file stack
- Bot taxonomy
- AI & LLM governance settings
- llms.txt settings