Why robots.txt is not enough for user-triggered AI agents

The modern web is no longer visited only by classic crawlers that discover pages for search.

Some machine visitors build search indexes. Some collect data for future model training. Some fetch pages to support answer-generation or grounding at query time. Some act only when a human asks them to. Some are verified and allowlisted at the CDN or WAF layer.

That changes the core question.

The question is no longer only which bots should I block in robots.txt? The better question is which control surface fits which type of machine access?

That distinction matters because robots.txt is still foundational, but it was designed for crawl access. It was not designed to cover every modern agentic workflow on its own.

robots.txt was built for crawl, not for every machine use

robots.txt remains the base layer of public crawl governance.

It is still the first file most automated crawlers check. It still gives site owners the most portable way to allow or disallow crawl by path and by agent. It is still the safest place to express category-level crawl intent.

But robots.txt only answers a specific question:

Can this crawler fetch this URL?

It does not, by itself, answer all of the other questions that now matter:

how search snippets or previews should behave;
whether content may be reused for training;
whether content may be pulled into answer-generation at query time;
whether a user-triggered fetch should be treated like automatic crawl;
whether a signed agent should be allowed through runtime infrastructure controls.

So the problem in 2026 is not that robots.txt became useless.

The problem is that many teams now expect it to solve jobs that belong to other layers.

The 4 machine-access families you need to separate

A practical governance model now starts by separating at least four families of machine access.

1. Search crawlers

These crawlers support search discovery, re-crawl, indexing, and refresh.

Typical examples include classic search bots such as Googlebot and Applebot, and search-oriented AI discovery bots such as OAI-SearchBot.

The main business question is usually visibility.

Do you want this site to remain discoverable in search products that can still send traffic back?

2. Training crawlers

These crawlers or product tokens are about future model development, not search discovery.

Typical examples include GPTBot, ClaudeBot, Google-Extended, and Applebot-Extended.

The main question is not indexing. It is downstream model use.

Do you want your content used to train or improve future generative systems?

3. Answer or retrieval systems

These systems exist to support answers, grounding, search optimization, or retrieval at or near query time.

They may work like crawlers, like retrieval bots, or like hybrid systems that mix indexing and on-demand fetching.

The main question is not only visibility. It is how your content is consumed in answers.

Do you want link-and-snippet discovery, or live model input, or neither?

4. User-triggered or signed agents

These are the systems that most often break the old mental model.

Some requests are initiated by users. Some are authenticated or cryptographically verified by infrastructure providers. Some are designed to navigate the web and perform actions, not only fetch pages for indexing.

This is where robots.txt may stop being the primary lever.

Same vendor, different control surfaces

One of the biggest governance mistakes is assuming one vendor equals one crawler and one control point.

That is no longer true.

Vendor	Surface	Main role	Primary control question
Google	`Googlebot`	Search crawl and search access	Do I want Google Search crawl and indexing?
Google	`Google-Extended`	Training and some other Google grounding uses	Do I allow this content to be reused outside Search in those systems?
Google	`Google-Agent`	User-triggered agent traffic	Is this traffic governed only by crawl policy, or does it require infrastructure handling?
OpenAI	`OAI-SearchBot`	Search visibility in ChatGPT search features	Do I want appearance in OpenAI search results?
OpenAI	`GPTBot`	Training collection	Do I allow training reuse?
OpenAI	`ChatGPT-User`	User-initiated visits	Should user-triggered retrieval be allowed here?
OpenAI	`ChatGPT agent`	Signed, allowlistable agent traffic	Do I need edge-level allowlisting and verification?
Anthropic	`ClaudeBot`	Training collection	Do I allow training collection?
Anthropic	`Claude-SearchBot`	Search optimization	Do I want my content indexed for Claude search quality?
Anthropic	`Claude-User`	User-directed retrieval	Do I allow user-triggered fetches?
Apple	`Applebot`	Search crawl for Apple surfaces	Do I want Apple search visibility?
Apple	`Applebot-Extended`	Data-usage control only	Do I allow Apple model-training use of Applebot-crawled data?
Bing / Microsoft	`bingbot` plus meta controls	Search crawl plus downstream Bing Chat / Copilot controls	Do I need crawl rules, preview rules, or both?

The key lesson is simple:

A vendor name is not a governance strategy.
You need a per-surface strategy.

Use the right control surface for the right job

Once the access family is clear, the correct control layer becomes much easier to choose.

Use `robots.txt` for crawl access

Use robots.txt when the real question is whether a crawler should be allowed to fetch certain URLs at all.

This is the right surface for:

search crawl access;
many training crawler opt-outs;
per-agent path rules;
low-value path reduction;
category-level crawl hygiene.

This is still the most important base layer.

Use meta robots or `X-Robots-Tag` for preview and indexing behavior

When the question is no longer crawl access but what may appear in search or preview output, page-level controls matter.

That is where meta robots and X-Robots-Tag belong.

They are the right layer for questions such as:

should this page be indexed;
should snippets be shown;
should previews be reduced;
should a resource stay crawlable but not indexable?

This matters especially when a search ecosystem says Search behavior is still governed by its search crawler, not by a separate AI-training token.

Use AI usage signals when the question is downstream use

If your question is no longer "can it fetch?" but "what kinds of reuse do I allow?", then you are in the usage layer.

That is where AI usage signals and Content-Signal style distinctions become useful.

Typical distinctions include:

search
ai-input
ai-train

These are not universal enforcement mechanisms. But they express a more precise usage posture than binary crawl access alone.

Use `llms.txt` and supporting governance files for attention and interpretation

llms.txt does not replace robots.txt. It does a different job.

It helps models and machine readers understand what matters on the site, what should be prioritized, and which reading path is safest.

The same is true for supporting governance surfaces such as:

These surfaces reduce misreading. They do not act as a firewall.

Use edge or infrastructure controls for signed and verified agents

Once you enter the world of signed agents, allowlisting, verification, or runtime permissions, you are no longer only in the crawl-policy layer.

That is infrastructure territory.

This can include:

CDN or WAF allowlisting;
bot verification at the edge;
request validation;
runtime rate controls;
session and permission boundaries.

This is where the plugin should not be overclaimed.

What Better Robots.txt actually does

Better Robots.txt helps with the part that belongs to the publishing and governance layer.

It helps site owners:

publish a structured robots.txt policy;
separate crawler categories before policy is generated;
publish AI usage signals and supporting governance surfaces;
publish llms.txt and related machine-readable guidance;
keep the site’s published policy more coherent across outputs.

In other words, it helps you choose and publish a clearer policy.

It does not transform a policy file into runtime enforcement.

What belongs to infrastructure, not to the plugin

Some controls clearly live outside the plugin’s core scope.

That includes:

signed-agent allowlisting;
cryptographic verification;
CDN and WAF rules;
rate-limiting and abuse controls;
authentication and session permissions;
runtime enforcement of access decisions.

That is why Better Robots.txt should be described as a governance and publishing layer, not as a security product and not as a guaranteed enforcement layer.

Quick answers

Is Google-Agent the same thing as Google-Extended?

No.

Google-Extended is a product token about downstream use of Google-crawled content for future Gemini training and some other grounding uses. Google-Agent is user-triggered agent traffic used by agents hosted on Google infrastructure. Those are different surfaces and different control problems.

Does blocking training automatically block discoverability?

No.

In many ecosystems, you can still stay visible in search while refusing some training-related reuse. That is the point of separating search crawlers from training crawlers or tokens.

Does `llms.txt` replace `robots.txt`?

No.

robots.txt remains the crawl-access layer. llms.txt is an attention and guidance layer. They solve different problems.

When do I need to move from plugin settings to edge controls?

When the problem requires verified identity, signed traffic, allowlisting, runtime permissions, WAF logic, or abuse control. At that point, the publishing layer is still useful, but it is no longer enough on its own.

Why robots.txt is not enough for user-triggered AI agents ​

robots.txt was built for crawl, not for every machine use ​

The 4 machine-access families you need to separate ​

1. Search crawlers ​

2. Training crawlers ​

3. Answer or retrieval systems ​

4. User-triggered or signed agents ​

Same vendor, different control surfaces ​

Use the right control surface for the right job ​

Use robots.txt for crawl access ​

Use meta robots or X-Robots-Tag for preview and indexing behavior ​

Use AI usage signals when the question is downstream use ​

Use llms.txt and supporting governance files for attention and interpretation ​

Use edge or infrastructure controls for signed and verified agents ​

What Better Robots.txt actually does ​

What belongs to infrastructure, not to the plugin ​

Quick answers ​

Is Google-Agent the same thing as Google-Extended? ​

Does blocking training automatically block discoverability? ​

Does llms.txt replace robots.txt? ​

When do I need to move from plugin settings to edge controls? ​

Related ​